GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/22328
[SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`) This datasource **NOT** support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. **cluster**: 4 nodes, each with 64GB memory, 8 cores CPU **test dataset**: Flickr8k_Dataset (about 8091 images) **time cost**: My image datasource time (automatically generate 258 partitions): 38.04s `ImageSchema.read` time (set 16 partitions): 68.4s `ImageSchema.read` time (set 258 partitions): 90.6s **time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images): My image datasource time (automatically generate 515 partitions): 95.4s `ImageSchema.read` (set 32 partitions): 109s `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark image_datasource Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22328.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22328 ---- commit 5b5aee66b2ea819341b624164298f0700ee07ddf Author: WeichenXu <weichen.xu@...> Date: 2018-09-04T09:48:50Z init pr ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org