GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21292
[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring ## What changes were proposed in this pull request? While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302 but the options are not propagated to Text datasource on schema inferring, for instance: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188 The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified. ## How was this patch tested? The changes were tested manually by using https://github.com/twitter/hadoop-lzo: ``` hadoop-lzo> mvn clean package hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar ``` Create 2 test files in JSON and CSV format and compress them: ```shell $ cat test.csv col1|col2 a|1 $ lzop test.csv $ cat test.json {"col1":"a","col2":1} $ lzop test.json ``` Run `spark-shell` with hadoop-lzo: ``` bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar ``` reading compressed CSV and JSON without schema: ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show() +----+----+ |col1|col2| +----+----+ | a| 1| +----+----+ ``` ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema root |-- col1: string (nullable = true) |-- col2: long (nullable = true) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 text-options-backport-v2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21292 ---- commit 9092faa573cf39faa7171f1335d612309452b644 Author: Maxim Gekk <max.gekk@...> Date: 2018-04-27T13:23:40Z Propagating DataFrameReader's options to the text datasource on schema inferring commit 7b4a6b40625028c7c367090f0fe48e0ec26bc79a Author: Maxim Gekk <maxim.gekk@...> Date: 2018-04-28T07:58:31Z Make textOptions serializable commit fe6c3c2cc9a113f7cd38185a0315484e3a3c99cc Author: Maxim Gekk <maxim.gekk@...> Date: 2018-05-05T09:16:44Z Adding @transient to textOptions because they shouldn't be serialized commit 831441b292c67c8de93eb25894df02579cbc0fd3 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-05-06T08:09:37Z Removing the separate val for textOptions commit f6ab928c1abcac239f9d857d86d2e2a966f8e091 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-05-06T08:53:25Z Removing unused imports ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org