GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/22328

    [SPARK-22666][ML][SQL] Spark datasource for image format

    ## What changes were proposed in this pull request?
    
    Implement an image schema datasource.
    
    This image datasource support:
      - partition discovery (loading partitioned images)
      - dropImageFailures (the same behavior with `ImageSchema.readImage`)
      - path wildcard matching (the same behavior with `ImageSchema.readImage`)
      - loading recursively from directory (different from 
`ImageSchema.readImage`, but use such path: `/path/to/dir/**`)
    
    This datasource **NOT** support:
      - specify `numPartitions` (it will be determined by datasource 
automatically)
      - sampling (you can use `df.sample` later but the sampling operator won't 
be pushdown to datasource)
    
    ## How was this patch tested?
    Unit tests.
    
    ## Benchmark
    I benchmark and compare the cost time between old `ImageSchema.read` API 
and my image datasource.
    
    **cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
    **test dataset**: Flickr8k_Dataset (about 8091 images)
    
    **time cost**:
    My image datasource time (automatically generate 258 partitions):  38.04s 
    `ImageSchema.read` time (set 16 partitions): 68.4s
    `ImageSchema.read` time (set 258 partitions):  90.6s
    
    **time cost when increase image number by double (clone Flickr8k_Dataset 
and loads double number images):
    My image datasource time (automatically generate 515 partitions):  95.4s 
    `ImageSchema.read` (set 32 partitions): 109s
    `ImageSchema.read` (set 515 partitions):  105s
    
    So we can see that my image datasource implementation (this PR) bring some 
performance improvement compared against old`ImageSchema.read` API.
    
    
    
    
    
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark image_datasource

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22328
    
----
commit 5b5aee66b2ea819341b624164298f0700ee07ddf
Author: WeichenXu <weichen.xu@...>
Date:   2018-09-04T09:48:50Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to