Re: Parquet Multiple Output

2015-06-12 Thread Cheng Lian
Spark 1.4 supports dynamic partitioning, you can first convert your RDD 
to a DataFrame and then save the contents partitioned by date column. 
Say you have a DataFrame df containing three columns a, b, and c, you 
may have something like this:


df.write.partitionBy("a", 
"b").mode("overwrite").parquet("path/to/file")


Cheng

On 6/13/15 5:31 AM, Xin Liu wrote:

Hi,

I have a scenario where I'd like to store a RDD using parquet format 
in many files, which corresponds to days, such as 2015/01/01, 
2015/02/02, etc.


So far I used this method

http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job

to store text files (then I have to read text files and convert to 
parquet and store again). Anyone has tried to store many parquet files 
from one RDD?


Thanks,
Xin




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet Multiple Output

2015-06-12 Thread Xin Liu
Hi,

I have a scenario where I'd like to store a RDD using parquet format in
many files, which corresponds to days, such as 2015/01/01, 2015/02/02, etc.

So far I used this method

http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job

to store text files (then I have to read text files and convert to parquet
and store again). Anyone has tried to store many parquet files from one RDD?

Thanks,
Xin