GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20959

    [SPARK-23846][SQL] The samplingRatio option for CSV datasource

    ## What changes were proposed in this pull request?
    
    I propose to support the `samplingRatio` option for schema inferring of CSV 
datasource similar to the same option of JSON datasource: 
    
https://github.com/apache/spark/blob/b14993e1fcb68e1c946a671c6048605ab4afdf58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L49-L50
    
    ## How was this patch tested?
    
    Added 2 tests for json and 2 tests for csv datasources. The tests checks 
that only subset of input dataset is used for schema inferring. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 csv-sampling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20959
    
----
commit 78160368a94651a1c9d92660f6f00e0a12d8e324
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T13:28:32Z

    Adding samplingRation tests for json

commit d79976feb001606e8103813984ecaddf8a66f847
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T13:34:06Z

    Removing debug code

commit bb5cfeee18c6ce17efd99642757a1c9b6e2c1145
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T13:43:16Z

    Porting samplingRatio test from JSONSuite

commit da9813b83e3c9bd35520c82e868a2493af0315a4
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T14:00:21Z

    Support the samplingRatio option for CSV datasource

commit a0966d0302261c499dc73df7fc2b16c508934e96
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T14:27:16Z

    Adding description of samplingRation to the csv and json methods of 
DataFrameReader

commit ba12fca8514a2cb620224e859e8a8b5cc208bf31
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-02T14:41:08Z

    Adding ticket number to test titles

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to