Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Reynold Xin
Almost all dataframe stuff are tracked by this umbrella ticket:
https://issues.apache.org/jira/browse/SPARK-6116

For the reader/writer interface, it's here:

https://issues.apache.org/jira/browse/SPARK-7654

https://github.com/apache/spark/pull/6175

On Tue, Jun 2, 2015 at 3:57 PM, Matt Cheah mch...@palantir.com wrote:

 Excellent! Where can I find the code, pull request, and Spark ticket where
 this was introduced?

 Thanks,

 -Matt Cheah

 From: Reynold Xin r...@databricks.com
 Date: Monday, June 1, 2015 at 10:25 PM
 To: Matt Cheah mch...@palantir.com
 Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim 
 m...@palantir.com, Andrew Ash a...@palantir.com
 Subject: Re: [SQL] Write parquet files under partition directories?

 There will be in 1.4.

 df.write.partitionBy(year, month, day).parquet(/path/to/output)

 On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote:

 Hi there,

 I noticed in the latest Spark SQL programming guide
 https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAsm=_7T9n01KFlQS8djMTP3ylblUaOYNr68mj286s8zIdQ8s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyce=,
 there is support for optimized reading of partitioned Parquet files that
 have a particular directory structure (year=1/month=10/day=3, for example).
 However, I see no analogous way to write DataFrames as Parquet files with
 similar directory structures based on user-provided partitioning.

 Generally, is it possible to write DataFrames as partitioned Parquet
 files that downstream partition discovery can take advantage of later? I
 considered extending the Parquet output format, but it looks like
 ParquetTableOperations.scala has fixed the output format to
 AppendingParquetOutputFormat.

 Also, I was wondering if it would be valuable to contribute writing
 Parquet in partition directories as a PR.

 Thanks,

 -Matt Cheah





Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Matt Cheah
Excellent! Where can I find the code, pull request, and Spark ticket where
this was introduced?

Thanks,

-Matt Cheah

From:  Reynold Xin r...@databricks.com
Date:  Monday, June 1, 2015 at 10:25 PM
To:  Matt Cheah mch...@palantir.com
Cc:  dev@spark.apache.org dev@spark.apache.org, Mingyu Kim
m...@palantir.com, Andrew Ash a...@palantir.com
Subject:  Re: [SQL] Write parquet files under partition directories?

There will be in 1.4.

df.write.partitionBy(year, month, day).parquet(/path/to/output)

On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote:
 Hi there,
 
 I noticed in the latest Spark SQL programming guide
 https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_la
 test_sql-2Dprogramming-2Dguide.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBr
 Z4tFb6oOnmz8r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAsm=_7T9n01KFlQS8djMT
 P3ylblUaOYNr68mj286s8zIdQ8s=VQxAw6mG9yopDs37lNi7H_CnYiFQumqDAn9A8881Xyce= ,
 there is support for optimized reading of partitioned Parquet files that have
 a particular directory structure (year=1/month=10/day=3, for example).
 However, I see no analogous way to write DataFrames as Parquet files with
 similar directory structures based on user-provided partitioning.
 
 Generally, is it possible to write DataFrames as partitioned Parquet files
 that downstream partition discovery can take advantage of later? I considered
 extending the Parquet output format, but it looks like
 ParquetTableOperations.scala has fixed the output format to
 AppendingParquetOutputFormat.
 
 Also, I was wondering if it would be valuable to contribute writing Parquet in
 partition directories as a PR.
 
 Thanks,
 
 -Matt Cheah





smime.p7s
Description: S/MIME cryptographic signature


Re: [SQL] Write parquet files under partition directories?

2015-06-01 Thread Reynold Xin
There will be in 1.4.

df.write.partitionBy(year, month, day).parquet(/path/to/output)

On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote:

 Hi there,

 I noticed in the latest Spark SQL programming guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html, there
 is support for optimized reading of partitioned Parquet files that have a
 particular directory structure (year=1/month=10/day=3, for example).
 However, I see no analogous way to write DataFrames as Parquet files with
 similar directory structures based on user-provided partitioning.

 Generally, is it possible to write DataFrames as partitioned Parquet files
 that downstream partition discovery can take advantage of later? I
 considered extending the Parquet output format, but it looks like
 ParquetTableOperations.scala has fixed the output format to
 AppendingParquetOutputFormat.

 Also, I was wondering if it would be valuable to contribute writing
 Parquet in partition directories as a PR.

 Thanks,

 -Matt Cheah