GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/20959
[SPARK-23846][SQL] The samplingRatio option for CSV datasource ## What changes were proposed in this pull request? I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource: https://github.com/apache/spark/blob/b14993e1fcb68e1c946a671c6048605ab4afdf58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L49-L50 ## How was this patch tested? Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 csv-sampling Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20959.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20959 ---- commit 78160368a94651a1c9d92660f6f00e0a12d8e324 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T13:28:32Z Adding samplingRation tests for json commit d79976feb001606e8103813984ecaddf8a66f847 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T13:34:06Z Removing debug code commit bb5cfeee18c6ce17efd99642757a1c9b6e2c1145 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T13:43:16Z Porting samplingRatio test from JSONSuite commit da9813b83e3c9bd35520c82e868a2493af0315a4 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T14:00:21Z Support the samplingRatio option for CSV datasource commit a0966d0302261c499dc73df7fc2b16c508934e96 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T14:27:16Z Adding description of samplingRation to the csv and json methods of DataFrameReader commit ba12fca8514a2cb620224e859e8a8b5cc208bf31 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-02T14:41:08Z Adding ticket number to test titles ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org