Re: Parquet Multiple Output

Cheng Lian Fri, 12 Jun 2015 17:00:13 -0700

Spark 1.4 supports dynamic partitioning, you can first convert your RDDto a DataFrame and then save the contents partitioned by date column.Say you have a DataFrame df containing three columns a, b, and c, youmay have something like this:

df.write.partitionBy("a","b").mode("overwrite").parquet("path/to/file")


Cheng

On 6/13/15 5:31 AM, Xin Liu wrote:

Hi,
I have a scenario where I'd like to store a RDD using parquet formatin many files, which corresponds to days, such as 2015/01/01,2015/02/02, etc.
So far I used this method

http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
to store text files (then I have to read text files and convert toparquet and store again). Anyone has tried to store many parquet filesfrom one RDD?
Thanks,
Xin



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parquet Multiple Output

Reply via email to