Hi all,
Below script is how we count distinct on columns jid and mid
sjv = LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
unique = DISTINCT jv.jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2302351.
If I use code below, got another result
sjv = LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP sjv ALL;
countv = FOREACH groupv {
jid_mid = sjv.(jid, mid);
unique = DISTINCT jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2290003.
If I concat the two columns with a delimiter never exists in jid and mid, I
got another result which I think is the correct answer of this aggregation.
sjv = LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
unique = DISTINCT jv.jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2386385.
I did a test with below script
sjv = LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP jv BY (jid, mid);
unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
store unique ....;
The hadoop counters showed that there are 2386385 records written into
HDFS. The number is as same as the 3rd pig script I list above.
Can anyone explain the difference among those three? Whey they lead to
different results?
Regards,
Min
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com