[jira] Created: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

Daniel Lescohier (JIRA) Thu, 12 Feb 2009 11:51:21 -0800

bad data output from STREAM operator in trunk (regression from 0.1.1)
---------------------------------------------------------------------


                 Key: PIG-672
                 URL: https://issues.apache.org/jira/browse/PIG-672
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
         Environment: Red Hat Enterprise Linux 4 & 5
Hadoop 0.18.2
            Reporter: Daniel Lescohier
            Priority: Critical


In the 0.1.1 release of pig, all of the following works fine; the problem is in 
the trunk version.  Here's a brief intro to the workflow (details below):

 * I have 174856784 lines of input data, each line is a unique title string.
 * I stream the data through `sha1.py`, which outputs a sha1 hash of each input 
line: a string of 40 hexadecimal digits.
 * I group on the hash, generating a count of each group, then filter on rows 
having a count > 1.
 * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes 
are unique.
 * I've also verified totally outside of Hadoop, using sort and uniq, that the 
hashes are unique.
 * pig trunk checkout with "last changed rev 737863" returns non-zero results; 
the 7 part-* files are 1.5MB.
 * I've tracked it down to the STREAM operation (details below).

Here's the pig-svn-trunk job that produces the hashes:

set job.name 'title hash';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash';

Here's the pig-0.1.1 job that produces the hashes:

set job.name 'title hash 011';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash.011';

Here's sha1.py:

#!/opt/cnet-python/default-2.5/bin/python
from sys import stdin, stdout
from hashlib import sha1

for line in stdin:
    h = sha1()
    h.update(line[:-1])
    stdout.write("%s\n" % h.hexdigest())

Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data 
generated by pig-svn-trunk:

set job.name 'h40';
hash = load '/home/danl/url_title/title_hash';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 
1.5MB each.

Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data 
generated by pig-0.1.1:

set job.name 'h40.011.nh';
hash = load '/home/danl/url_title/title_hash.011';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh 
are 0KB each.

Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data 
generated by pig-svn-trunk:

set job.name 'h40.011';
hash = load '/home/danl/url_title/title_hash';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.011';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 
are 1.5MB each.

Therefore, it's the hash data generated by pig-svn-trunk 
(/home/danl/url_title/title_hash) which has duplicates in it.

Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  
You can see that lines five and six are duplicates.  It looks like the stream 
operator read the same line twice from the Python program? The job which 
produces the hashes is a map-only job, no reduces.

8f3513136b1c8b87b8b73b9d39d96555095e9cdd
2edb20c5a3862cc5f545ae649f1e26430a38bda4
ca9c216629fce16b4c113c0d9fcf65f906ab5e04
03fe80633822215a6935bcf95305bb14adf23f18
03fe80633822215a6935bcf95305bb14adf23f18
6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c

After narrowing it down to the stream operator in pig-svn-trunk, I decided to 
run the find dupes job again using pig-svn-trunk, but to first pipe the data 
through cat.  Cat shouldn't change the data at all, it's an identity operation. 
 Here's the job:

set job.name 'h40.cat';
DEFINE Cmd `cat`;
row = load '/home/danl/url_title/title_hash';
hash = stream row through Cmd;
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.cat';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat 
are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' 
job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so 
piping it through `cat` produced even more duplicates, when `cat` is not 
supposed to change the results at all.

I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the 
hashes again into another directory, just to make sure it wasn't a fluke run 
that produced duplicate hashes. The second time around, it also produced 
duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes 
produced by pig-svn-trunk from the second run, I also got 1.5MB output files.

For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data 
generated from pig-0.1.1:

set job.name 'h40.trk.nh';
hash = load '/home/danl/url_title/title_hash.011';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh 
are 0 bytes.  It's clear that it's the stream operation running in 
pig-svn-trunk which is producing the duplicates.

Here is the complete svn info of the checkout I built pig from:

Path: .
URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 737873
Node Kind: directory
Schedule: normal
Last Changed Author: pradeepkth
Last Changed Rev: 737863
Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)

When I built it, I also ran all the unit tests.

This was all run on Hadoop 0.18.2.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

Reply via email to