bad data output from STREAM operator in trunk (regression from 0.1.1) ---------------------------------------------------------------------
Key: PIG-672 URL: https://issues.apache.org/jira/browse/PIG-672 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Environment: Red Hat Enterprise Linux 4 & 5 Hadoop 0.18.2 Reporter: Daniel Lescohier Priority: Critical In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version. Here's a brief intro to the workflow (details below): * I have 174856784 lines of input data, each line is a unique title string. * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits. * I group on the hash, generating a count of each group, then filter on rows having a count > 1. * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique. * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique. * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB. * I've tracked it down to the STREAM operation (details below). Here's the pig-svn-trunk job that produces the hashes: set job.name 'title hash'; DEFINE Cmd `sha1.py` ship('sha1.py'); row = load '/home/danl/url_title/unique_titles'; hashes = stream row through Cmd; store hashes into '/home/danl/url_title/title_hash'; Here's the pig-0.1.1 job that produces the hashes: set job.name 'title hash 011'; DEFINE Cmd `sha1.py` ship('sha1.py'); row = load '/home/danl/url_title/unique_titles'; hashes = stream row through Cmd; store hashes into '/home/danl/url_title/title_hash.011'; Here's sha1.py: #!/opt/cnet-python/default-2.5/bin/python from sys import stdin, stdout from hashlib import sha1 for line in stdin: h = sha1() h.update(line[:-1]) stdout.write("%s\n" % h.hexdigest()) Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk: set job.name 'h40'; hash = load '/home/danl/url_title/title_hash'; grouped = group hash by $0 parallel 7; counted = foreach grouped generate group, COUNT(hash) as cnt; having = filter counted by cnt > 1; store having into '/home/danl/url_title/title_hash_collisions/h40'; The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each. Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1: set job.name 'h40.011.nh'; hash = load '/home/danl/url_title/title_hash.011'; grouped = group hash by $0 parallel 7; counted = foreach grouped generate group, COUNT(hash) as cnt; having = filter counted by cnt > 1; store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh'; The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each. Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk: set job.name 'h40.011'; hash = load '/home/danl/url_title/title_hash'; grouped = group hash by $0 parallel 7; counted = foreach grouped generate group, COUNT(hash) as cnt; having = filter counted by cnt > 1; store having into '/home/danl/url_title/title_hash_collisions/h40.011'; The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each. Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it. Here are the first six lines of /home/danl/url_title/title_hash/part-00064. You can see that lines five and six are duplicates. It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces. 8f3513136b1c8b87b8b73b9d39d96555095e9cdd 2edb20c5a3862cc5f545ae649f1e26430a38bda4 ca9c216629fce16b4c113c0d9fcf65f906ab5e04 03fe80633822215a6935bcf95305bb14adf23f18 03fe80633822215a6935bcf95305bb14adf23f18 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat. Cat shouldn't change the data at all, it's an identity operation. Here's the job: set job.name 'h40.cat'; DEFINE Cmd `cat`; row = load '/home/danl/url_title/title_hash'; hash = stream row through Cmd; grouped = group hash by $0 parallel 7; counted = foreach grouped generate group, COUNT(hash) as cnt; having = filter counted by cnt > 1; store having into '/home/danl/url_title/title_hash_collisions/h40.cat'; The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job. The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all. I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates. Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files. For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1: set job.name 'h40.trk.nh'; hash = load '/home/danl/url_title/title_hash.011'; grouped = group hash by $0 parallel 7; counted = foreach grouped generate group, COUNT(hash) as cnt; having = filter counted by cnt > 1; store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh'; The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes. It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates. Here is the complete svn info of the checkout I built pig from: Path: . URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 737873 Node Kind: directory Schedule: normal Last Changed Author: pradeepkth Last Changed Rev: 737863 Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009) When I built it, I also ran all the unit tests. This was all run on Hadoop 0.18.2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.