Spark 1.4 supports dynamic partitioning, you can first convert your RDD
to a DataFrame and then save the contents partitioned by date column.
Say you have a DataFrame df containing three columns a, b, and c, you
may have something like this:
df.write.partitionBy("a",
"b").mode("overwrite").parquet("path/to/file")
Cheng
On 6/13/15 5:31 AM, Xin Liu wrote:
Hi,
I have a scenario where I'd like to store a RDD using parquet format
in many files, which corresponds to days, such as 2015/01/01,
2015/02/02, etc.
So far I used this method
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
to store text files (then I have to read text files and convert to
parquet and store again). Anyone has tried to store many parquet files
from one RDD?
Thanks,
Xin
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org