count distinct on multiple columns

Min Zhou Tue, 29 Oct 2013 11:09:59 -0700

Hi all,

Below script is how we count distinct on columns jid and mid


sjv =  LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
        unique = DISTINCT jv.jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2302351.

If I use code below, got another result
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP sjv ALL;
countv = FOREACH groupv {
        jid_mid = sjv.(jid, mid);
        unique = DISTINCT jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2290003.

If I concat the two columns with a delimiter never exists in jid and mid, I
got another result which I think is the correct answer of this aggregation.
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
        unique = DISTINCT jv.jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2386385.

I did a test with below script
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP jv BY (jid, mid);
unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
store unique ....;
The hadoop counters showed that there are 2386385 records written into
HDFS.  The number is as same as the 3rd pig script I list above.

Can anyone explain the difference among those three?  Whey they lead to
different results?

Regards,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

count distinct on multiple columns

Reply via email to