> On March 27, 2015, 6:40 a.m., Mohit Sabharwal wrote: > > src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java, > > line 43 > > <https://reviews.apache.org/r/32031/diff/2/?file=893978#file893978line43> > > > > Do we necessarily need to coalesce to a single partition in order to > > count number of elements in the RDD ? > > > > Also, we later do processing one partition at a time, using > > mapPartitions: > > rdd.toJavaRDD().mapPartitions(collectedGroupFunction, true).rdd(); > > > > Isn't the count computed inaccurate ? (it represents all elements in > > rdd, not just all elements in a partition - and mapParitions process one > > partition at a time) > > > > Also, are we processing one partition at a time (mapParitions vs map) > > here mostly for efficiency ?
Ah, ignore my comment about count being inaccurate since it's accumulated across invocations of collectedGroupFunction. - Mohit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32031/#review78021 ----------------------------------------------------------- On March 13, 2015, 10:42 a.m., Praveen Rachabattuni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/32031/ > ----------------------------------------------------------- > > (Updated March 13, 2015, 10:42 a.m.) > > > Review request for pig, liyun zhang and Mohit Sabharwal. > > > Bugs: PIG-4193 > https://issues.apache.org/jira/browse/PIG-4193 > > > Repository: pig-git > > > Description > ------- > > Moved getNextTuple(boolean proceed) method from POCollectedGroup to > POCollectedGroupSpark. > > Collected group when used with mr performs group operation in the mapside > after making sure all data for same key exists on single map. This behaviour > in spark is achieved by a single map on function using POCollectedGroup > operator. > > TODO: > - Avoid using rdd.count() in CollectedGroupConverter. > > > Diffs > ----- > > > src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCollectedGroup.java > 7f2f18e52e083b3e8e90ba02d07f12bcbc9be859 > src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java > ca7a45f33320064e22628b40b34be7b9f7b07c36 > > src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java > 3d04ba11855c39960e00d6f51b66654d1c70ebad > > src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POCollectedGroupSpark.java > PRE-CREATION > > src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkCompiler.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/32031/diff/ > > > Testing > ------- > > Tested TestCollectedGroup and do not have any new successes or failures. > > > Thanks, > > Praveen Rachabattuni > >