Re: pyspark split pair rdd to multiple

Wei Chen Wed, 20 Apr 2016 08:31:28 -0700

Let's assume K is String, and V is Integer,
schema = StructType([StructField("K", StringType(), True), StructField("V",
IntegerType(), True)])
df = sqlContext.createDataFrame(rdd, schema=schema)
udf1 = udf(lambda x: [x], ArrayType(IntegerType()))
df1 = df.select("K", udf1("V").alias("arrayV"))
df1.show()



On Tue, Apr 19, 2016 at 12:51 PM, pth001 <patcharee.thong...@uni.no> wrote:

> Hi,
>
> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
> Pyspark?
>
> Best,
> Patcharee
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984

Re: pyspark split pair rdd to multiple

Reply via email to