Hello,

I'm migrating some RDD-based code to using DataFrames. We've seen massive
speedups so far!

One of the operations in the old code creates an array of the values for
each key, as follows:

val collatedRDD =
valuesRDD.mapValues(value=>Array(value)).reduceByKey((array1,array2) =>
array1++array2)

I was wondering if there is a similar way to achieve this with a DataFrame
via the DataFrame API, or whether we need to use RDD operations on the
DataFrame to get this functionality?

>From what I've seen all the SQL aggregations output a single value, and
slices output a single array of rows. To rephrase my question I guess I'm
wondering if there is some way to use aggregation or slicing on a DataFrame
to output some collection (rdd / array / etc) of arrays, with one array for
each distinct value in a given column of the DataFrame.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-slice-by-key-with-DataFrames-tp23636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to