[ 
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447086#comment-13447086
 ] 

Dmitriy V. Ryaboy commented on PIG-2661:
----------------------------------------

Ok, for TestSkewedJoin, I think I know what's going on but not how to fix it.

Here's the explain plan for the sampler job after this patch:

{code}
MapReduce node scope-24
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-27
|   |
|   Constant(all) - scope-26
|
|---New For Each(true,true)[tuple] - scope-25
    |   |
    |   Project[bytearray][0] - scope-14
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-22
    |   |
    |   |---Project[tuple][*] - scope-21
    |
    |---A: New For Each(false,false,false)[bag] - scope-55
        |   |
        |   Project[bytearray][0] - scope-52
        |   |
        |   Project[bytearray][1] - scope-53
        |   |
        |   Project[bytearray][2] - scope-54
        |
        |---Load(hdfs://localhost:58995/user/dmitriy/SkewedJoinInput1.txt:org.ap
ache.pig.impl.builtin.PoissonSampleLoader('org.apache.pig.builtin.PigStorage','1
00')) - scope-23--------
{code}

Here are the corresponding bits prior to the patch:

{code}

MapReduce node scope-18
Map Plan
Store(hdfs://localhost:59383/tmp/temp220048876/tmp99560328:org.apache.pig.impl.i
o.InterStorage) - scope-20
|
|---A: New For Each(false,false,false)[bag] - scope-7
    |   |
    |   Project[bytearray][0] - scope-1
    |   |
    |   Project[bytearray][1] - scope-3
    |   |
    |   Project[bytearray][2] - scope-5
    |
    |---A: Load(hdfs://localhost:59383/user/dmitriy/SkewedJoinInput1.txt:org.apa
che.pig.builtin.PigStorage) - scope-0--------
Global sort: false
----------------

MapReduce node scope-24
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-27
|   |
|   Constant(all) - scope-26
|
|---New For Each(true,true)[tuple] - scope-25
    |   |
    |   Project[bytearray][0] - scope-14
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-22
    |   |
    |   |---Project[tuple][*] - scope-21
    |
    |---Load(hdfs://localhost:59383/tmp/temp220048876/tmp99560328:org.apache.pig
.impl.builtin.PoissonSampleLoader('org.apache.pig.impl.io.InterStorage','100')) 

{code}

What's happening is that the foreach to generate the first 3 columns, which Pig 
now adds to ensure types, etc, work, is happening between the Sample Loader and 
the GetMemNumRows udf. Sample Loader adds a couple of columns to the last tuple 
it outputs, with some stats about the dataset it saw. When we put the 
projection between it and the GetMemNumRows, those extra columns get dropped, 
and GetMemNumRows winds up completely breaking down, assuming that each sample 
occurs 0 times, and the whole skewed join thing just turns into a regular join. 
 We have to either get rid of the foreach, or add the columns 
PoissonSampleLoader adds, to the foreach.
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch, 
> PIG-2661.3.patch, PIG-2661.4.patch, PIG-2661.5.patch, PIG-2661.6.patch, 
> PIG-2661.7.patch, PIG-2661.8.patch, PIG-2661.plan.txt
>
>
> See 
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to