In general mapside cogroups are not possible unless the underlying storage mechanism can guarantee that all instances of a the key you are cogrouping on are in a single map instance. At this point only Zebra can guarantee that. If you're interested I can give more details on why join works and cogroup doesn't.

You can do IN for filter without needing a full mapside cogroup. You could implement this via a UDF that loads the small bag into a hash table and probes the table for each record it is passed.

Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman wrote:

Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
"replicated" can't be used with COGROUP, no?
Any work around?

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
From: Ankur C. Goel <gan...@yahoo-inc.com>
To: "pig-user@hadoop.apache.org" <pig-user@hadoop.apache.org>
Sent: Tue, June 1, 2010 12:39:56 PM
Subject: Re: Pig facility analogous to SQL's IN?

If data represented by relation B can fit in memory than you can simply use a
"replicated" join which is inexpensive and is a map-side join.

C = JOIN
A by a2, B by b1 USING "replicated";

-...@nkur


On 5/31/10 3:32
PM, "BalaSundaraRaman" <
href="mailto:sundarbe...@yahoo.com";>sundarbe...@yahoo.com>
wrote:

Hi,

Is there any operator or UDF in Pig similar to the IN
operator of SQL?
Specifically, given a large bag A and a very small
single-column bag B, I want to select tuples in A with a field a1 that has its
value in B.
My current method of doing it using a JOIN (below) seems very
expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
(a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING
PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by
b1;

It'll be very useful if such an operator is available for use in
FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is
NOT IN B::b1, currently, there's no easy way, I
guess.


Thanks,
Sundar (a Pig n00b)

"That language is an
instrument of human reason, and not merely a medium for the expression of
thought, is a truth generally admitted."
- George Boole, quoted in Iverson's
Turing Award Lecture

Reply via email to