Hi there,

I'm doing a join like this:

A = LOAD '/data/sessions' USING PigStorage(',') AS
(userid:chararray, client_type:chararray, flag:long);

A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(543872)

B = LOAD '/data/userdb'  USING PigStorage(',') AS (uid:chararray,
birth_year:int);
A = JOIN A by userid, B by uid;
A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(1079122)

Now the dataset has more rows than before the join which is basically the
opposite of what I'm expecting as not all userids on A do have a uid on the
B dataset.

Does anyone of you do have a hint what the problem here is?

Thanks,
-Marco

Reply via email to