Thanks Alan. I'm definitely interested in knowing why it won't work in cogroup 
the same way.

Will try to implement the IN UDF, though, I've only written simple eval udf's 
only so far.

- Sundar

 "That language is an instrument of human reason, and not merely a medium for 
the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Alan Gates <ga...@yahoo-inc.com>
> To: pig-user@hadoop.apache.org
> Sent: Tue, June 1, 2010 11:02:31 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> In general mapside cogroups are not possible unless the underlying storage 
> mechanism can guarantee that all instances of a the key you are cogrouping on 
> are in a single map instance.  At this point only Zebra can guarantee 
> that.  If you're interested I can give more details on why join works and 
> cogroup doesn't.

You can do IN for filter without needing a full mapside 
> cogroup.  You could implement this via a UDF that loads the small bag into 
> a hash table and probes the table for each record it is 
> passed.

Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman 
> wrote:

> Thanks Ankur. But, in my actual case, it's a COGROUP and not 
> a join.
> "replicated" can't be used with COGROUP, no?
> Any work 
> around?
> 
> - Sundar
> 
> "That language is an 
> instrument of human reason, and not merely a medium for the expression of 
> thought, is a truth generally admitted."
> - George Boole, quoted in 
> Iverson's Turing Award Lecture
> 
> 
> 
> ----- Original 
> Message ----
>> From: Ankur C. Goel <
> ymailto="mailto:gan...@yahoo-inc.com"; 
> href="mailto:gan...@yahoo-inc.com";>gan...@yahoo-inc.com>
>> To: 
> "
> href="mailto:pig-user@hadoop.apache.org";>pig-user@hadoop.apache.org" <
> ymailto="mailto:pig-user@hadoop.apache.org"; 
> href="mailto:pig-user@hadoop.apache.org";>pig-user@hadoop.apache.org>
>> 
> Sent: Tue, June 1, 2010 12:39:56 PM
>> Subject: Re: Pig facility 
> analogous to SQL's IN?
>> 
>> If data represented by relation 
> B can fit in memory than you can simply use a
>> "replicated" join 
> which is inexpensive and is a map-side join.
> 
> C = 
> JOIN
>> A by a2, B by b1 USING "replicated";
> 
> 
> -...@nkur
> 
> 
> On 5/31/10 3:32
>> PM, 
> "BalaSundaraRaman" <
>> href="mailto:
> ymailto="mailto:sundarbe...@yahoo.com"; 
> href="mailto:sundarbe...@yahoo.com";>sundarbe...@yahoo.com">
> ymailto="mailto:sundarbe...@yahoo.com"; 
> href="mailto:sundarbe...@yahoo.com";>sundarbe...@yahoo.com>
>> 
> wrote:
> 
> Hi,
> 
> Is there any operator or UDF in Pig 
> similar to the IN
>> operator of SQL?
> Specifically, given a 
> large bag A and a very small
>> single-column bag B, I want to select 
> tuples in A with a field a1 that has its
>> value in B.
> My 
> current method of doing it using a JOIN (below) seems very
>> 
> expensive.
> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') 
> AS
>> (a1:chararray,a2:chararray);
> grunt> B = LOAD 
> '/tmp/b.txt' USING
>> PigStorage(',') AS (b1:chararray);
> 
> grunt> C = JOIN A by a2, B by
>> b1;
> 
> It'll be very 
> useful if such an operator is available for use in
>> FILTER and SPLIT 
> as well.
> For example, if I need to substitute '0' when a2 is
>> 
> NOT IN B::b1, currently, there's no easy way, I
>> guess.
> 
> 
> 
> Thanks,
> Sundar (a Pig n00b)
> 
> "That 
> language is an
>> instrument of human reason, and not merely a medium 
> for the expression of
>> thought, is a truth generally 
> admitted."
> - George Boole, quoted in Iverson's
>> Turing Award 
> Lecture

Reply via email to