Let's assume K is String, and V is Integer, schema = StructType([StructField("K", StringType(), True), StructField("V", IntegerType(), True)]) df = sqlContext.createDataFrame(rdd, schema=schema) udf1 = udf(lambda x: [x], ArrayType(IntegerType())) df1 = df.select("K", udf1("V").alias("arrayV")) df1.show()
On Tue, Apr 19, 2016 at 12:51 PM, pth001 <patcharee.thong...@uni.no> wrote: > Hi, > > How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in > Pyspark? > > Best, > Patcharee > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984