Thanks for the help Alan, I really appreciate it. Can you currently extend
interfaces in python UDF's? I am not super familiar with how jython and
python interact in that capacity.

The internal sort in the foreach and the using 'collected' (assuming I can
get it to work :) should be big wins.

2011/1/4 Alan Gates <ga...@yahoo-inc.com>

> Answers inline.
>
>
> On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:
>
>  I wasn't quite sure what title this, but hopefully it'll make sense. I
>> have
>> a couple of questions relating to a query that ultimately seeks to do this
>>
>> You have
>>
>> 1 10
>> 1 12
>> 1 15
>> 1 16
>> 2 1
>> 2 2
>> 2 3
>> 2 6
>>
>> You want your output to be the difference between the successive numbers
>> in
>> the second column, ie
>>
>> 1 (10,0)
>> 1 (12,2)
>> 1 (15,3)
>> 1 (15,1)
>> 2 (1,0)
>> 2 (2,1)
>> 2 (3,1)
>> 2 (6,3)
>>
>> Obviously, I need to write a udf to do this, but I have a couple
>> questions..
>>
>> 1) if we know for a fact that the rows for a given first column will
>> ALWAYS
>> be on the same node, do we need to do anything to take advantage of that?
>> My
>> assumption would be that the group operation would be smart enough to take
>> care of this, but I am not sure how it avoids checking to make sure that
>> other nodes don't have additional info (even if I can say for a fact that
>> they don't). Then again, given replication of data I guess if you do an
>> operation on the grouped data it might still try and distribute that over
>> the filesystem?
>>
>
> First, whether they are located in the same node does not matter.  What
> matters is whether they will all be in the same split when the maps are
> started.  If they are stored in an HDFS file this usually means that they
> are all in the same block.
>
> Group by cannot know a priori that all values of the key will be located in
> the same split.  As of Pig 0.7 you can tell Pig this by saying "using
> 'collected'" after the group by statement.  See
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP for exact
> syntax and restrictions. This tells Pig to do the grouping in the map phase
> since it does not need to do a shuffle and reduce to collect all the keys
> together.
>
>
>
>> 2) The number of values in the second column can potentially be large, and
>> I
>> want this process to be quick, so what's the best way to implement it?
>> Naively I would say to group everything, then pass that bag to a UDF which
>> sorts, does the calculation, and then returns a new bag with the tuples.
>> This doesn't seem like it is taking advantage of a distributed
>> framework...would splitting it up into 2 UDF's, one which sorts the bag,
>> and
>> then another which returns the tuples (and now that it's sorted, you could
>> distribute it better), be better?
>>
>
> B = group A by firstfield;
> C = foreach B {
>        C1 = order A by secondfield;
>        generate group, youudf(C1);
> }
>
> The order inside the foreach will order each collection by the second
> field, so there's no need to write a UDF for that.  In fact Pig will take
> advantage of the secondary sort in MR so that there isn't even a separate
> sorting pass over the data.  yourudf should then implement the Accumulator
> interface so that it will receive collections of records in batches that
> will be sorted.
>
> Alan.
>
>
>
>> I'm trying to avoid writing my own MR (as I never have before), but am not
>> averse to it if necessary. I am just not sure of how to get pig to do it
>> as
>> efficiently as (I think) it can be done.
>>
>> I appreciate your help!
>> Jon
>>
>
>

Reply via email to