subject:"Parition RDD by key to create DataFrames"

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu

I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet. r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM,

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI

Hi, I have a pair RDD of the form: (mykey, (value1, value2)) How can I create a DataFrame having the schema [V1 String, V2 String] to store [value1, value2] and save it into a Parquet table named "mykey"? /createDataFrame()/ method takes an RDD and a schema (StructType) in parameters. The