Chan, Sorry, I meant ordered = ORDER inputData BY date; not ordered = ORDER inputData BY key;
On Wed, Mar 13, 2013 at 7:06 AM, Ruslan Al-Fakikh <[email protected]>wrote: > Hi Chan, > > Your tasks seems to be not trivial in Pig. Basically bags are not ordered, > so you have to either sort before or to decide what tuple you want to > remove exactly. Some ways to solve the problem: > 1) You can use the TOP builtin UDF which basically does the opposite and I > am not sure whether it will suit you from the performance point of view > 2) You can try something like this: > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter: > chararray); > grouped = GROUP inputData BY key; > DESCRIBE grouped; > DUMP grouped; > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count; > DESCRIBE withCounts; > DUMP withCounts; > trimmed = FOREACH withCounts { > ordered = ORDER inputData BY key; > limited = LIMIT ordered (withCounts.Count - 1); > GENERATE > group, > limited; > } > DESCRIBE trimmed; > DUMP trimmed; > > This is what I got when run on Pig 0.10: > > grouped: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)}} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)}) > > withCounts: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)},Count: long} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3) > > trimmed: {group: chararray,limited: {(key: chararray,date: > chararray,letter: chararray)}} > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)}) > > I am not sure whether it will perform well. Let me know if it helps. > > Best Regards, > Ruslan Al-Fakikh > > > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[email protected]>wrote: > >> Hi, Chan: >> That's fine. How did you generate the bag with different size in runtime. >> It will be easier for me to come out a solution by this information. >> Thanks. >> >> Johnny >> >> >> On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[email protected]> wrote: >> >> > Hi Johnny, >> > >> > I forgot to mention the bag will be varying sizes, so I can not use the >> > method you described. >> > >> > >> > >> > >> > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[email protected]> >> > wrote: >> > >> > > Hi, Chan: >> > > I guess you might generate the bag like this >> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); >> > > B = group A by f1; >> > > C = foreach B generate *; >> > > describe C; >> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: >> > chararray)}} >> > > >> > > if this is the case, you can do: >> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); >> > > B = group A by f1; >> > > C = foreach B generate group, A.f1, A.f2; >> > > describe C; >> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}} >> > > >> > > does this make sense? otherwise can you share your script which >> generates >> > > the bag? >> > > >> > > Johnny Zhang >> > > >> > > >> > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[email protected]> wrote: >> > > >> > > > How do I remove the last item in a bag. >> > > > >> > > > For example: >> > > > >> > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) >> > > > >> > > > >> > > > I would like to remove the last item so that the following is the >> > result: >> > > > >> > > > >> > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) >> > > > >> > > >> > >> > >
