Re: Review Request 32031: PIG-4193: Make collected group work with Spark

Mohit Sabharwal Fri, 27 Mar 2015 00:04:16 -0700


> On March 27, 2015, 6:40 a.m., Mohit Sabharwal wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java,
> >  line 43
> > <https://reviews.apache.org/r/32031/diff/2/?file=893978#file893978line43>
> >
> >     Do we necessarily need to coalesce to a single partition in order to 
> > count number of elements in the RDD ? 
> >     
> >     Also, we later do processing one partition at a time, using 
> > mapPartitions:
> >     rdd.toJavaRDD().mapPartitions(collectedGroupFunction, true).rdd();
> >     
> >     Isn't the count computed inaccurate ? (it represents all elements in 
> > rdd, not just all elements in a partition - and mapParitions process one 
> > partition at a time)
> >     
> >     Also, are we processing one partition at a time (mapParitions vs map) 
> > here mostly for efficiency ?


Ah, ignore my comment about count being inaccurate since it's accumulated 
across invocations of collectedGroupFunction.


- Mohit


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32031/#review78021
-----------------------------------------------------------


On March 13, 2015, 10:42 a.m., Praveen Rachabattuni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32031/
> -----------------------------------------------------------
> 
> (Updated March 13, 2015, 10:42 a.m.)
> 
> 
> Review request for pig, liyun zhang and Mohit Sabharwal.
> 
> 
> Bugs: PIG-4193
>     https://issues.apache.org/jira/browse/PIG-4193
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Moved getNextTuple(boolean proceed) method from POCollectedGroup to 
> POCollectedGroupSpark.
> 
> Collected group when used with mr performs group operation in the mapside 
> after making sure all data for same key exists on single map. This behaviour 
> in spark is achieved by a single map on function using POCollectedGroup 
> operator.
> 
> TODO:
> - Avoid using rdd.count() in CollectedGroupConverter.
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCollectedGroup.java
>  7f2f18e52e083b3e8e90ba02d07f12bcbc9be859 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
> ca7a45f33320064e22628b40b34be7b9f7b07c36 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java
>  3d04ba11855c39960e00d6f51b66654d1c70ebad 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POCollectedGroupSpark.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkCompiler.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/32031/diff/
> 
> 
> Testing
> -------
> 
> Tested TestCollectedGroup and do not have any new successes or failures.
> 
> 
> Thanks,
> 
> Praveen Rachabattuni
> 
>

Re: Review Request 32031: PIG-4193: Make collected group work with Spark

Reply via email to