Thanks for the help Alan, I really appreciate it. Can you currently extend interfaces in python UDF's? I am not super familiar with how jython and python interact in that capacity.
The internal sort in the foreach and the using 'collected' (assuming I can get it to work :) should be big wins. 2011/1/4 Alan Gates <ga...@yahoo-inc.com> > Answers inline. > > > On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote: > > I wasn't quite sure what title this, but hopefully it'll make sense. I >> have >> a couple of questions relating to a query that ultimately seeks to do this >> >> You have >> >> 1 10 >> 1 12 >> 1 15 >> 1 16 >> 2 1 >> 2 2 >> 2 3 >> 2 6 >> >> You want your output to be the difference between the successive numbers >> in >> the second column, ie >> >> 1 (10,0) >> 1 (12,2) >> 1 (15,3) >> 1 (15,1) >> 2 (1,0) >> 2 (2,1) >> 2 (3,1) >> 2 (6,3) >> >> Obviously, I need to write a udf to do this, but I have a couple >> questions.. >> >> 1) if we know for a fact that the rows for a given first column will >> ALWAYS >> be on the same node, do we need to do anything to take advantage of that? >> My >> assumption would be that the group operation would be smart enough to take >> care of this, but I am not sure how it avoids checking to make sure that >> other nodes don't have additional info (even if I can say for a fact that >> they don't). Then again, given replication of data I guess if you do an >> operation on the grouped data it might still try and distribute that over >> the filesystem? >> > > First, whether they are located in the same node does not matter. What > matters is whether they will all be in the same split when the maps are > started. If they are stored in an HDFS file this usually means that they > are all in the same block. > > Group by cannot know a priori that all values of the key will be located in > the same split. As of Pig 0.7 you can tell Pig this by saying "using > 'collected'" after the group by statement. See > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP for exact > syntax and restrictions. This tells Pig to do the grouping in the map phase > since it does not need to do a shuffle and reduce to collect all the keys > together. > > > >> 2) The number of values in the second column can potentially be large, and >> I >> want this process to be quick, so what's the best way to implement it? >> Naively I would say to group everything, then pass that bag to a UDF which >> sorts, does the calculation, and then returns a new bag with the tuples. >> This doesn't seem like it is taking advantage of a distributed >> framework...would splitting it up into 2 UDF's, one which sorts the bag, >> and >> then another which returns the tuples (and now that it's sorted, you could >> distribute it better), be better? >> > > B = group A by firstfield; > C = foreach B { > C1 = order A by secondfield; > generate group, youudf(C1); > } > > The order inside the foreach will order each collection by the second > field, so there's no need to write a UDF for that. In fact Pig will take > advantage of the secondary sort in MR so that there isn't even a separate > sorting pass over the data. yourudf should then implement the Accumulator > interface so that it will receive collections of records in batches that > will be sorted. > > Alan. > > > >> I'm trying to avoid writing my own MR (as I never have before), but am not >> averse to it if necessary. I am just not sure of how to get pig to do it >> as >> efficiently as (I think) it can be done. >> >> I appreciate your help! >> Jon >> > >