Hi Michael
This is great info. I am currently using repartitionandsort function to
achieve the same. Is this the recommended way till 1.3 or is there any
better way?
On 23 May 2015 07:38, "Michael Armbrust" <[email protected]> wrote:
> DataFrames have a lot more information about the data, so there is a whole
> class of optimizations that are possible there that we cannot do in RDDs.
> This is why we are focusing a lot of effort on this part of the project.
> In Spark 1.4 you can accomplish what you want using the new window function
> feature. This can be done with SQL as you described or directly on a
> DataFrame:
>
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions._
>
> val df = Seq(("a", 1), ("b", 1), ("c", 2), ("d", 2)).toDF("x", "y")
> df.select('x, 'y,
> rowNumber.over(Window.partitionBy("y").orderBy("x")).as("number")).show
>
> +-+-+------+
> |x|y|number|
> +-+-+------+
> |a|1| 1|
> |b|1| 2|
> |c|2| 1|
> |d|2| 2|
> +-+-+------+
>
> On Fri, May 22, 2015 at 3:35 AM, gtanguy <[email protected]>
> wrote:
>
>> Hello everybody,
>>
>> I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
>> of my code using groupBy became really slow.
>>
>> *1/ *Why does the groupBy of rdd is really slow in comparison to the
>> groupBy
>> of dataFrame?
>>
>> // DataFrame : running in few seconds
>> val result = table.groupBy("col1").count
>>
>> // RDD : taking hours with a lot of /spilling in-memory/
>> val schemaOriginel = table.schema
>> val result = table.rdd.groupBy { r =>
>> val rs = RowSchema(r, schemaOriginel)
>> val col1 = rs.getValueByName("col1")
>> col1
>> }.map(l => (l._1,l._2.size) ).count()
>>
>>
>> *2/* My goal is to groupBy on a key, then to order each group over a
>> column
>> and finally to add the row number in each group. I had this code running
>> before changing to Spark 1.3 and it worked fine, but since I have changed
>> to
>> DataFrame it is really slow.
>>
>> val schemaOriginel = table.schema
>> val result = table.rdd.groupBy { r =>
>> val rs = RowSchema(r, schemaOriginel)
>> val col1 = rs.getValueByName("col1")
>> col1
>> }.flatMap {
>> l =>
>> l._2.toList
>> .sortBy {
>> u =>
>> val rs = RowSchema(u, schemaOriginel)
>> val col1 = rs.getValueByName("col1")
>> val col2 = rs.getValueByName("col2")
>> (col1, col2)
>> } .zipWithIndex
>> }
>>
>> /I think the SQL equivalent of what I try to do : /
>>
>> SELECT a,
>> ROW_NUMBER() OVER (PARTITION BY a) AS num
>> FROM table.
>>
>>
>> I don't think I can do this with a GroupedData (result of df.groupby).
>> Any
>> ideas on how I can speed up this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>