[ https://issues.apache.org/jira/browse/SPARK-28505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893698#comment-16893698 ]
Hyukjin Kwon commented on SPARK-28505: -------------------------------------- I don't quite understand. Hive also reads partitioned columns as well. {code} scala> val myDF = spark.range(10).selectExpr("id as value1", "id as value2", "id as year", "id as month", "id as day") myDF: org.apache.spark.sql.DataFrame = [value1: bigint, value2: bigint ... 3 more fields] scala> myDF.select("value1", "value2", "year","month","day").write.format("csv").option("header", "true").partitionBy("year","month","day").save("/tmp/foo") {code} {code} ➜ ~ cd /tmp/foo ➜ foo ls _SUCCESS year=0 year=1 year=2 year=3 year=4 year=5 year=6 year=7 year=8 year=9 ➜ foo tree . . ├── _SUCCESS ├── year=0 │ └── month=0 │ └── day=0 │ └── part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=1 │ └── month=1 │ └── day=1 │ └── part-00002-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=2 │ └── month=2 │ └── day=2 │ └── part-00003-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=3 │ └── month=3 │ └── day=3 │ └── part-00004-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=4 │ └── month=4 │ └── day=4 │ └── part-00005-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=5 │ └── month=5 │ └── day=5 │ └── part-00007-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=6 │ └── month=6 │ └── day=6 │ └── part-00008-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=7 │ └── month=7 │ └── day=7 │ └── part-00009-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv ├── year=8 │ └── month=8 │ └── day=8 │ └── part-00010-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv └── year=9 └── month=9 └── day=9 └── part-00011-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv 30 directories, 11 files ➜ foo cat part-00001-517c30de-3c23-4ed0-bb44-f5729cd05fec.c000.csv value1,value2 0,0 {code} and Spark doesn't save partitioned column in its output file. > Add data source option for omitting partitioned columns when saving to file > --------------------------------------------------------------------------- > > Key: SPARK-28505 > URL: https://issues.apache.org/jira/browse/SPARK-28505 > Project: Spark > Issue Type: Wish > Components: Input/Output, Spark Core > Affects Versions: 2.4.4, 3.0.0 > Reporter: Juarez Rudsatz > Priority: Minor > > It is very useful to have a option for omiting the columns used in > partitioning from the output while writing to a file data source like csv, > avro, parquet, orc or excel. > Consider the following code: > {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}} > {{myDF.select("value1", "value2", "year","month","day")}} > {{.write().format("csv")}} > {{.option("header", "true")}} > {{.partionBy("year","month","day")}} > {{.save("hdfs://user/spark/warehouse/csv_output_dir");}} > This will output many files in separated folders in a structure like: > {{csv_output_dir/_SUCCESS}} > > {{csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}} > > {{csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv}} > {{...}} > And the output will be something like: > {{┌──────┬──────┬──────┬───────┬─────┐}} > {{│ val1 │ val2 │ year │ month │ day │}} > {{├──────┼──────┼──────┼───────┼─────┤}} > {{│ 3673 │ 2345 │ 2019 │ 7 │ 10 │}} > {{│ 2345 │ 3423 │ 2019 │ 7 │ 10 │}} > {{│ 8765 │ 2423 │ 2019 │ 7 │ 10 │}} > {{└──────┴──────┴──────┴───────┴─────┘}} > When using partitioning in HIVE, the output from same source data will be > something like: > {{┌──────┬──────┐}} > {{│ val1 │ val2 │}} > {{├──────┼──────┤}} > {{│ 3673 │ 2345 │}} > {{│ 2345 │ 3423 │}} > {{│ 8765 │ 2423 │}} > {{└──────┴──────┘}} > In this case the columns of the partitioning are not present in the CSV > files. However output files follows the same folder/path structure as > existing today. > Please considere adding a opt-in config for DataFrameWriter for leaving out > the partitioning columns as in the second example. > The code could be something like: > {{Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);}} > {{myDF.select("value1", "value2", "year","month","day")}} > {{.write().format("csv")}} > {{.option("header", "true")}} > *{{.option("partition.omit.cols", "true")}}* > {{.partionBy("year","month","day")}} > {{.save("hdfs://user/spark/warehouse/csv_output_dir");}} > Thanks. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org