[ https://issues.apache.org/jira/browse/SPARK-28505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-28505. ---------------------------------- Resolution: Invalid > Add data source option for omitting partitioned columns when saving to file > --------------------------------------------------------------------------- > > Key: SPARK-28505 > URL: https://issues.apache.org/jira/browse/SPARK-28505 > Project: Spark > Issue Type: Wish > Components: Input/Output, Spark Core > Affects Versions: 2.4.4, 3.0.0 > Reporter: Juarez Rudsatz > Priority: Minor > > It is very useful to have a option for omiting the columns used in > partitioning from the output while writing to a file data source like csv, > avro, parquet, orc or excel. > Consider the following code: > {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}} > {{myDF.select("value1", "value2", "year","month","day")}} > {{.write().format("csv")}} > {{.option("header", "true")}} > {{.partionBy("year","month","day")}} > {{.save("hdfs://user/spark/warehouse/csv_output_dir");}} > This will output many files in separated folders in a structure like: > {{csv_output_dir/_SUCCESS}} > > {{csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}} > > {{csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}} > {{...}} > And the output will be something like: > {{┌──────┬──────┬──────┬───────┬─────┐}} > {{│ val1 │ val2 │ year │ month │ day │}} > {{├──────┼──────┼──────┼───────┼─────┤}} > {{│ 3673 │ 2345 │ 2019 │ 7 │ 10 │}} > {{│ 2345 │ 3423 │ 2019 │ 7 │ 10 │}} > {{│ 8765 │ 2423 │ 2019 │ 7 │ 10 │}} > {{└──────┴──────┴──────┴───────┴─────┘}} > When using partitioning in HIVE, the output from same source data will be > something like: > {{┌──────┬──────┐}} > {{│ val1 │ val2 │}} > {{├──────┼──────┤}} > {{│ 3673 │ 2345 │}} > {{│ 2345 │ 3423 │}} > {{│ 8765 │ 2423 │}} > {{└──────┴──────┘}} > In this case the columns of the partitioning are not present in the CSV > files. However output files follows the same folder/path structure as > existing today. > Please considere adding a opt-in config for DataFrameWriter for leaving out > the partitioning columns as in the second example. > The code could be something like: > {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}} > {{myDF.select("value1", "value2", "year","month","day")}} > {{.write().format("csv")}} > {{.option("header", "true")}} > *{{.option("partition.omit.cols", "true")}}* > {{.partionBy("year","month","day")}} > {{.save("hdfs://user/spark/warehouse/csv_output_dir");}} > Thanks. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org