[ 
https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1932:
----------------------------

    Status: Open  (was: Patch Available)

Daniel convinced me I should use the parallelism value from the cross, since 
what's really important about this is how many join groups it creates.  You 
want to create enough groups to keep each reducers busy.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to 
> determine how wide to spread the records in a cross.  It is currently hard 
> wired to 96.  There are no comments in the code on how that value was settled 
> on.  Despite the name, this value is not necessarily related to the reduce 
> parallelism controlled by the parallel clause.  It controls how many 
> artificial join key values are generated and how many times each record is 
> duplicated before going through the join.  The higher it is set the more key 
> values (and thus the less likely the cross will run out of memory) but also 
> the more times each record is duplicated in the map phase before being sent 
> to the reduce.  
> We should leave the default value at 96 but allow a property to override this 
> default and change the value.
> We cannot use a constructor argument here because the use of the UDF is not 
> exposed to the user, so he has no opportunity to pass a constructor argument 
> to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to