[jira] [Updated] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.
[ https://issues.apache.org/jira/browse/SPARK-21448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qihuagao updated SPARK-21448: - Description: java pair rdd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the parameters are merged into one. This step merges values across partitions. While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. was: java pair rrd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the parameters are merged into one. This step merges values across partitions. While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. > Hi dear guys, I have a question about aggregateByKey of pairrrd. > - > > Key: SPARK-21448 > URL: https://issues.apache.org/jira/browse/SPARK-21448 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.0 > Environment: Spark 2.0 >Reporter: qihuagao > > java pair rdd has aggregateByKey, which can avoid full shuffle, so have > impressive performance. which has parameters, > The aggregateByKey function requires 3 parameters: > # An intitial ‘zero’ value that will not effect the total values to be > collected > # A combining function accepting two paremeters. The second paramter is > merged into the first parameter. This function combines/merges values within > a partition. > # A merging function function accepting two parameters. In this case the > parameters are merged into one. This step merges values across partitions. > While Dataframe, I noticed groupByKey, which could do save function as > aggregateByKey, but without merge functions, so I assumed it should trigger > shuffle operation. Is this true? if true should we have a funtion like the > performance like aggregateByKey for dataframe? > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.
[ https://issues.apache.org/jira/browse/SPARK-21448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qihuagao updated SPARK-21448: - Description: java pair rrd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the parameters are merged into one. This step merges values across partitions. While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. was: java pair rrd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the paremters are merged into one. This step merges values across partitions. While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. > Hi dear guys, I have a question about aggregateByKey of pairrrd. > - > > Key: SPARK-21448 > URL: https://issues.apache.org/jira/browse/SPARK-21448 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.0 > Environment: Spark 2.0 >Reporter: qihuagao > > java pair rrd has aggregateByKey, which can avoid full shuffle, so have > impressive performance. which has parameters, > The aggregateByKey function requires 3 parameters: > # An intitial ‘zero’ value that will not effect the total values to be > collected > # A combining function accepting two paremeters. The second paramter is > merged into the first parameter. This function combines/merges values within > a partition. > # A merging function function accepting two parameters. In this case the > parameters are merged into one. This step merges values across partitions. > While Dataframe, I noticed groupByKey, which could do save function as > aggregateByKey, but without merge functions, so I assumed it should trigger > shuffle operation. Is this true? if true should we have a funtion like the > performance like aggregateByKey for dataframe? > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org