GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/23052

    [SPARK-26081][SQL] Prevent empty files for empty partitions in Text 
datasources

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to postpone creation of 
`OutputStream`/`Univocity`/`JacksonGenerator` till the first row should be 
written. This prevents creation of empty files for empty partitions. So, no 
need to open and to read such files back while loading data from the location.
    
    ## How was this patch tested?
    
    Added tests for Text, JSON and CSV datasource where empty dataset is 
written but should not produce any files.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 text-empty-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23052
    
----
commit 3efa7b615f7c37538edb0afca30d4f300ac07aee
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T19:44:47Z

    Added a test for text datasource

commit 80aadf645ab63885ce6f43ac74b0c02871e10883
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T20:11:00Z

    Creating output stream on the first write

commit 0a774ef9e4de987c9f3073b90396215b9f04ca16
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T20:20:27Z

    Test for csv

commit 47b71b7a235ffcdfa79753307f1afcb377a17977
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T20:21:06Z

    Don't produce empty CSV files

commit 040c71f8ea49ca10160cfa242095d6ebd2d76a8d
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T20:22:23Z

    Test for JSON

commit 6f3cb18d5a863f6aded763bdeb5395f6622876ff
Author: Maxim Gekk <max.gekk@...>
Date:   2018-11-15T20:32:32Z

    Do not produce empty JSON files

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to