[GitHub] spark pull request #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrame...

MaxGekk Thu, 10 May 2018 06:00:08 -0700

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21292


    [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text 
datasource on schema inferring

    ## What changes were proposed in this pull request?
    
    While reading CSV or JSON files, DataFrameReader's options are converted to 
Hadoop's parameters, for example there:
    
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302
    
    but the options are not propagated to Text datasource on schema inferring, 
for instance:
    
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188
    
    The PR proposes propagation of user's options to Text datasource on scheme 
inferring in similar way as user's options are converted to Hadoop parameters 
if schema is specified.
    
    ## How was this patch tested?
    The changes were tested manually by using 
https://github.com/twitter/hadoop-lzo:
    
    ```
    hadoop-lzo> mvn clean package
    hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar
    ```
    Create 2 test files in JSON and CSV format and compress them:
    ```shell
    $ cat test.csv
    col1|col2
    a|1
    $ lzop test.csv
    $ cat test.json
    {"col1":"a","col2":1}
    $ lzop test.json
    ```
    Run `spark-shell` with hadoop-lzo:
    ```
    bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar
    ```
    reading compressed CSV and JSON without schema:
    ```scala
    spark.read.option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
    +----+----+
    |col1|col2|
    +----+----+
    |   a|   1|
    +----+----+
    ```
    ```scala
    spark.read.option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").option("multiLine", 
true).json("test.json.lzo").printSchema
    root
     |-- col1: string (nullable = true)
     |-- col2: long (nullable = true)
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 text-options-backport-v2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21292.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21292
    
----
commit 9092faa573cf39faa7171f1335d612309452b644
Author: Maxim Gekk <max.gekk@...>
Date:   2018-04-27T13:23:40Z

    Propagating DataFrameReader's options to the text datasource on schema 
inferring

commit 7b4a6b40625028c7c367090f0fe48e0ec26bc79a
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-28T07:58:31Z

    Make textOptions serializable

commit fe6c3c2cc9a113f7cd38185a0315484e3a3c99cc
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-05T09:16:44Z

    Adding @transient to textOptions because they shouldn't be serialized

commit 831441b292c67c8de93eb25894df02579cbc0fd3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-06T08:09:37Z

    Removing the separate val for textOptions

commit f6ab928c1abcac239f9d857d86d2e2a966f8e091
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-06T08:53:25Z

    Removing unused imports

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrame...

Reply via email to