JOIN after SAMPLE appears to resample the original dataset, leads to major 
confusion
------------------------------------------------------------------------------------

                 Key: PIG-2133
                 URL: https://issues.apache.org/jira/browse/PIG-2133
             Project: Pig
          Issue Type: Bug
          Components: build
    Affects Versions: 0.8.1
         Environment: CentOS release 5.5 (Final)
            Reporter: Gabor Szabo


The following example illustrates:

data1 = LOAD ...
sampled_data1 = SAMPLE data1 0.1;
STORE sampled_data1 INTO sampled_data1.file;

data2 = LOAD ...
joined = JOIN sampled_data1 BY field1, data2 BY field2;
STORE joined INTO joined.file;

What I found is that records in joined.file DO NOT appear in 
sampled_data1.file, although if they could be joined with sampled_data1 in the 
first place, they should. The execution steps seem to indicate that the 2nd MR 
step generates a new sample from data1 before it joins it with data2. Therefore 
if the sampling rate is low we won't find most of the records of joined in 
sampled_data1.file .

The fix (thanks Dmitriy!): force flushing and reload the sampled data by 
inserting after the first STORE:

EXEC;
sampled_data1 = LOAD sampled_data1.file;


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to