JOIN after SAMPLE appears to resample the original dataset, leads to major
confusion
------------------------------------------------------------------------------------
Key: PIG-2133
URL: https://issues.apache.org/jira/browse/PIG-2133
Project: Pig
Issue Type: Bug
Components: build
Affects Versions: 0.8.1
Environment: CentOS release 5.5 (Final)
Reporter: Gabor Szabo
The following example illustrates:
data1 = LOAD ...
sampled_data1 = SAMPLE data1 0.1;
STORE sampled_data1 INTO sampled_data1.file;
data2 = LOAD ...
joined = JOIN sampled_data1 BY field1, data2 BY field2;
STORE joined INTO joined.file;
What I found is that records in joined.file DO NOT appear in
sampled_data1.file, although if they could be joined with sampled_data1 in the
first place, they should. The execution steps seem to indicate that the 2nd MR
step generates a new sample from data1 before it joins it with data2. Therefore
if the sampling rate is low we won't find most of the records of joined in
sampled_data1.file .
The fix (thanks Dmitriy!): force flushing and reload the sampled data by
inserting after the first STORE:
EXEC;
sampled_data1 = LOAD sampled_data1.file;
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira