[jira] [Updated] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.

2017-07-19 Thread qihuagao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qihuagao updated SPARK-21448:
-
Description: 
java pair rdd has aggregateByKey, which can avoid full shuffle, so have 
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be collected
# A combining function accepting two paremeters. The second paramter is merged 
into the first parameter. This function combines/merges values within a 
partition.
# A merging function function accepting two parameters. In this case the 
parameters are merged into one. This step merges values across partitions.

While Dataframe, I noticed groupByKey, which could do save function as 
aggregateByKey, but without merge functions, so I assumed it should trigger 
shuffle operation. Is this true? if true should we have a funtion like the 
performance like  aggregateByKey for dataframe?

Thanks.

  was:
java pair rrd has aggregateByKey, which can avoid full shuffle, so have 
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be collected
# A combining function accepting two paremeters. The second paramter is merged 
into the first parameter. This function combines/merges values within a 
partition.
# A merging function function accepting two parameters. In this case the 
parameters are merged into one. This step merges values across partitions.

While Dataframe, I noticed groupByKey, which could do save function as 
aggregateByKey, but without merge functions, so I assumed it should trigger 
shuffle operation. Is this true? if true should we have a funtion like the 
performance like  aggregateByKey for dataframe?

Thanks.


> Hi dear guys,  I have a question about aggregateByKey of pairrrd.
> -
>
> Key: SPARK-21448
> URL: https://issues.apache.org/jira/browse/SPARK-21448
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: Spark 2.0
>Reporter: qihuagao
>
> java pair rdd has aggregateByKey, which can avoid full shuffle, so have 
> impressive performance. which has parameters, 
> The aggregateByKey function requires 3 parameters:
> # An intitial ‘zero’ value that will not effect the total values to be 
> collected
> # A combining function accepting two paremeters. The second paramter is 
> merged into the first parameter. This function combines/merges values within 
> a partition.
> # A merging function function accepting two parameters. In this case the 
> parameters are merged into one. This step merges values across partitions.
> While Dataframe, I noticed groupByKey, which could do save function as 
> aggregateByKey, but without merge functions, so I assumed it should trigger 
> shuffle operation. Is this true? if true should we have a funtion like the 
> performance like  aggregateByKey for dataframe?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.

2017-07-17 Thread qihuagao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qihuagao updated SPARK-21448:
-
Description: 
java pair rrd has aggregateByKey, which can avoid full shuffle, so have 
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be collected
# A combining function accepting two paremeters. The second paramter is merged 
into the first parameter. This function combines/merges values within a 
partition.
# A merging function function accepting two parameters. In this case the 
parameters are merged into one. This step merges values across partitions.

While Dataframe, I noticed groupByKey, which could do save function as 
aggregateByKey, but without merge functions, so I assumed it should trigger 
shuffle operation. Is this true? if true should we have a funtion like the 
performance like  aggregateByKey for dataframe?

Thanks.

  was:
java pair rrd has aggregateByKey, which can avoid full shuffle, so have 
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be collected
# A combining function accepting two paremeters. The second paramter is merged 
into the first parameter. This function combines/merges values within a 
partition.
# A merging function function accepting two parameters. In this case the 
paremters are merged into one. This step merges values across partitions.
While Dataframe, I noticed groupByKey, which could do save function as 
aggregateByKey, but without merge functions, so I assumed it should trigger 
shuffle operation. Is this true? if true should we have a funtion like the 
performance like  aggregateByKey for dataframe?

Thanks.


> Hi dear guys,  I have a question about aggregateByKey of pairrrd.
> -
>
> Key: SPARK-21448
> URL: https://issues.apache.org/jira/browse/SPARK-21448
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: Spark 2.0
>Reporter: qihuagao
>
> java pair rrd has aggregateByKey, which can avoid full shuffle, so have 
> impressive performance. which has parameters, 
> The aggregateByKey function requires 3 parameters:
> # An intitial ‘zero’ value that will not effect the total values to be 
> collected
> # A combining function accepting two paremeters. The second paramter is 
> merged into the first parameter. This function combines/merges values within 
> a partition.
> # A merging function function accepting two parameters. In this case the 
> parameters are merged into one. This step merges values across partitions.
> While Dataframe, I noticed groupByKey, which could do save function as 
> aggregateByKey, but without merge functions, so I assumed it should trigger 
> shuffle operation. Is this true? if true should we have a funtion like the 
> performance like  aggregateByKey for dataframe?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org