Re: [SQL] Write parquet files under partition directories?
Almost all dataframe stuff are tracked by this umbrella ticket: https://issues.apache.org/jira/browse/SPARK-6116 For the reader/writer interface, it's here: https://issues.apache.org/jira/browse/SPARK-7654 https://github.com/apache/spark/pull/6175 On Tue, Jun 2, 2015 at 3:57 PM, Matt Cheah mch...@palantir.com wrote: Excellent! Where can I find the code, pull request, and Spark ticket where this was introduced? Thanks, -Matt Cheah From: Reynold Xin r...@databricks.com Date: Monday, June 1, 2015 at 10:25 PM To: Matt Cheah mch...@palantir.com Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim m...@palantir.com, Andrew Ash a...@palantir.com Subject: Re: [SQL] Write parquet files under partition directories? There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAsm=_7T9n01KFlQS8djMTP3ylblUaOYNr68mj286s8zIdQ8s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyce=, there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no analogous way to write DataFrames as Parquet files with similar directory structures based on user-provided partitioning. Generally, is it possible to write DataFrames as partitioned Parquet files that downstream partition discovery can take advantage of later? I considered extending the Parquet output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah
Re: [SQL] Write parquet files under partition directories?
Excellent! Where can I find the code, pull request, and Spark ticket where this was introduced? Thanks, -Matt Cheah From: Reynold Xin r...@databricks.com Date: Monday, June 1, 2015 at 10:25 PM To: Matt Cheah mch...@palantir.com Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim m...@palantir.com, Andrew Ash a...@palantir.com Subject: Re: [SQL] Write parquet files under partition directories? There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_la test_sql-2Dprogramming-2Dguide.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBr Z4tFb6oOnmz8r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAsm=_7T9n01KFlQS8djMT P3ylblUaOYNr68mj286s8zIdQ8s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyce= , there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no analogous way to write DataFrames as Parquet files with similar directory structures based on user-provided partitioning. Generally, is it possible to write DataFrames as partitioned Parquet files that downstream partition discovery can take advantage of later? I considered extending the Parquet output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Re: [SQL] Write parquet files under partition directories?
There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html, there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no analogous way to write DataFrames as Parquet files with similar directory structures based on user-provided partitioning. Generally, is it possible to write DataFrames as partitioned Parquet files that downstream partition discovery can take advantage of later? I considered extending the Parquet output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah