Hello everybody,
I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
of my code using groupBy became really slow.
*1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy
of dataFrame?
// DataFrame : running in few seconds
val result = table.groupBy("col1").count
// RDD : taking hours with a lot of /spilling in-memory/
val schemaOriginel = table.schema
val result = table.rdd.groupBy { r =>
val rs = RowSchema(r, schemaOriginel)
val col1 = rs.getValueByName("col1")
col1
}.map(l => (l._1,l._2.size) ).count()
*2/* My goal is to groupBy on a key, then to order each group over a column
and finally to add the row number in each group. I had this code running
before changing to Spark 1.3 and it worked fine, but since I have changed to
DataFrame it is really slow.
val schemaOriginel = table.schema
val result = table.rdd.groupBy { r =>
val rs = RowSchema(r, schemaOriginel)
val col1 = rs.getValueByName("col1")
col1
}.flatMap {
l =>
l._2.toList
.sortBy {
u =>
val rs = RowSchema(u, schemaOriginel)
val col1 = rs.getValueByName("col1")
val col2 = rs.getValueByName("col2")
(col1, col2)
} .zipWithIndex
}
/I think the SQL equivalent of what I try to do : /
SELECT a,
ROW_NUMBER() OVER (PARTITION BY a) AS num
FROM table.
I don't think I can do this with a GroupedData (result of df.groupby). Any
ideas on how I can speed up this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]