[jira] [Resolved] (SPARK-28505) Add data source option for omitting partitioned columns when saving to file

Hyukjin Kwon (JIRA) Fri, 26 Jul 2019 03:00:19 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-28505.
----------------------------------
    Resolution: Invalid

> Add data source option for omitting partitioned columns when saving to file
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-28505
>                 URL: https://issues.apache.org/jira/browse/SPARK-28505
>             Project: Spark
>          Issue Type: Wish
>          Components: Input/Output, Spark Core
>    Affects Versions: 2.4.4, 3.0.0
>            Reporter: Juarez Rudsatz
>            Priority: Minor
>
> It is very useful to have a option for omiting the columns used in 
> partitioning from the output while writing to a file data source like csv, 
> avro, parquet, orc or excel.
> Consider the following code:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
>  {{myDF.select("value1", "value2", "year","month","day")}}
>  {{.write().format("csv")}}
>  {{.option("header", "true")}}
>  {{.partionBy("year","month","day")}}
>  {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> This will output many files in separated folders in a structure like:
> {{csv_output_dir/_SUCCESS}}
>  
> {{csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
>  
> {{csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}}
>  {{...}}
> And the output will be something like:
> {{┌──────┬──────┬──────┬───────┬─────┐}}
>  {{│ val1 │ val2 │ year │ month │ day │}}
>  {{├──────┼──────┼──────┼───────┼─────┤}}
>  {{│ 3673 │ 2345 │ 2019 │     7 │ 10  │}}
>  {{│ 2345 │ 3423 │ 2019 │     7 │ 10  │}}
>  {{│ 8765 │ 2423 │ 2019 │     7 │ 10  │}}
>  {{└──────┴──────┴──────┴───────┴─────┘}}
> When using partitioning in HIVE, the output from same source data will be 
> something like:
> {{┌──────┬──────┐}}
>  {{│ val1 │ val2 │}}
>  {{├──────┼──────┤}}
>  {{│ 3673 │ 2345 │}}
>  {{│ 2345 │ 3423 │}}
>  {{│ 8765 │ 2423 │}}
>  {{└──────┴──────┘}}
> In this case the columns of the partitioning are not present in the CSV 
> files. However output files follows the same folder/path structure as 
> existing today.
> Please considere adding a opt-in config for DataFrameWriter for leaving out 
> the partitioning columns as in the second example.
> The code could be something like:
> {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}}
>  {{myDF.select("value1", "value2", "year","month","day")}}
>  {{.write().format("csv")}}
>  {{.option("header", "true")}}
>  *{{.option("partition.omit.cols", "true")}}*
>  {{.partionBy("year","month","day")}}
>  {{.save("hdfs://user/spark/warehouse/csv_output_dir");}}
> Thanks.
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28505) Add data source option for omitting partitioned columns when saving to file

Reply via email to