Hi,
you do not need to do anything with the RDD at all. Just follow the
instructions in this site https://github.com/databricks/spark-csv and
everything will be super fast and smooth.
Remember that in case the data is large then converting RDD to dataframes
takes a very very very very long time.
Let's assume K is String, and V is Integer,
schema = StructType([StructField("K", StringType(), True), StructField("V",
IntegerType(), True)])
df = sqlContext.createDataFrame(rdd, schema=schema)
udf1 = udf(lambda x: [x], ArrayType(IntegerType()))
df1 = df.select("K", udf1("V").alias("arrayV"))
I can also use dataframe. Any suggestions?
Best,
Patcharee
On 20. april 2016 10:43, Gourav Sengupta wrote:
Is there any reason why you are not using data frames?
Regards,
Gourav
On Tue, Apr 19, 2016 at 8:51 PM, pth001 > wrote:
Is there any reason why you are not using data frames?
Regards,
Gourav
On Tue, Apr 19, 2016 at 8:51 PM, pth001 wrote:
> Hi,
>
> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
> Pyspark?
>
> Best,
> Patcharee
>
>
Hi,
How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in Pyspark?
Best,
Patcharee
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org