Re: DataFrame groupBy vs RDD groupBy

ayan guha Sat, 23 May 2015 17:04:23 -0700

Hi Michael

This is great info. I am currently using repartitionandsort function to
achieve the same. Is this the recommended way till 1.3 or is there any
better way?
On 23 May 2015 07:38, "Michael Armbrust" <[email protected]> wrote:


> DataFrames have a lot more information about the data, so there is a whole
> class of optimizations that are possible there that we cannot do in RDDs.
> This is why we are focusing a lot of effort on this part of the project.
> In Spark 1.4 you can accomplish what you want using the new window function
> feature.  This can be done with SQL as you described or directly on a
> DataFrame:
>
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions._
>
> val df = Seq(("a", 1), ("b", 1), ("c", 2), ("d", 2)).toDF("x", "y")
> df.select('x, 'y,
> rowNumber.over(Window.partitionBy("y").orderBy("x")).as("number")).show
>
> +-+-+------+
> |x|y|number|
> +-+-+------+
> |a|1|     1|
> |b|1|     2|
> |c|2|     1|
> |d|2|     2|
> +-+-+------+
>
> On Fri, May 22, 2015 at 3:35 AM, gtanguy <[email protected]>
> wrote:
>
>> Hello everybody,
>>
>> I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
>> of my code using groupBy became really slow.
>>
>> *1/ *Why does the groupBy of rdd is really slow in comparison to the
>> groupBy
>> of dataFrame?
>>
>> // DataFrame : running in few seconds
>> val result = table.groupBy("col1").count
>>
>> // RDD : taking hours with a lot of /spilling in-memory/
>> val schemaOriginel = table.schema
>> val result = table.rdd.groupBy { r =>
>>      val rs = RowSchema(r, schemaOriginel)
>>      val col1 = rs.getValueByName("col1")
>>      col1
>>   }.map(l => (l._1,l._2.size) ).count()
>>
>>
>> *2/* My goal is to groupBy on a key, then to order each group over a
>> column
>> and finally to add the row number in each group. I had this code running
>> before changing to Spark 1.3 and it worked fine, but since I have changed
>> to
>> DataFrame it is really slow.
>>
>>  val schemaOriginel = table.schema
>>  val result = table.rdd.groupBy { r =>
>>         val rs = RowSchema(r, schemaOriginel)
>>         val col1 = rs.getValueByName("col1")
>>          col1
>>         }.flatMap {
>>          l =>
>>            l._2.toList
>>              .sortBy {
>>               u =>
>>                 val rs = RowSchema(u, schemaOriginel)
>>                 val col1 = rs.getValueByName("col1")
>>                 val col2 = rs.getValueByName("col2")
>>                 (col1, col2)
>>             } .zipWithIndex
>>         }
>>
>> /I think the SQL equivalent of what I try to do : /
>>
>> SELECT a,
>>        ROW_NUMBER() OVER (PARTITION BY a) AS num
>> FROM table.
>>
>>
>>  I don't think I can do this with a GroupedData (result of df.groupby).
>> Any
>> ideas on how I can speed up this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: DataFrame groupBy vs RDD groupBy

Reply via email to