Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Hi, you do not need to do anything with the RDD at all. Just follow the instructions in this site https://github.com/databricks/spark-csv and everything will be super fast and smooth. Remember that in case the data is large then converting RDD to dataframes takes a very very very very long time.

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen
Let's assume K is String, and V is Integer, schema = StructType([StructField("K", StringType(), True), StructField("V", IntegerType(), True)]) df = sqlContext.createDataFrame(rdd, schema=schema) udf1 = udf(lambda x: [x], ArrayType(IntegerType())) df1 = df.select("K", udf1("V").alias("arrayV"))

Re: pyspark split pair rdd to multiple

2016-04-20 Thread patcharee
I can also use dataframe. Any suggestions? Best, Patcharee On 20. april 2016 10:43, Gourav Sengupta wrote: Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 > wrote:

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 wrote: > Hi, > > How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in > Pyspark? > > Best, > Patcharee > >

pyspark split pair rdd to multiple

2016-04-19 Thread pth001
Hi, How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in Pyspark? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org