subject:"Parquet Multiple Output"

Parquet Multiple Output

2015-06-12 Thread Xin Liu

Hi, I have a scenario where I'd like to store a RDD using parquet format in many files, which corresponds to days, such as 2015/01/01, 2015/02/02, etc. So far I used this method http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job to store text files

Re: Parquet Multiple Output

2015-06-12 Thread Cheng Lian

Spark 1.4 supports dynamic partitioning, you can first convert your RDD to a DataFrame and then save the contents partitioned by date column. Say you have a DataFrame df containing three columns a, b, and c, you may have something like this: df.write.partitionBy(a,