oh, sorry. It seems that my script worked only for the case where we have
only 1 group. Basically here
withCounts.Count
I wanted to get access to the Count field in the context of the row being
processed and it should be only 1 for one row, but with withCounts.Count it
seems that it actually accesses the outer context and sees many rows in
withCounts.
Maybe someone else has any idea?


On Thu, Mar 14, 2013 at 12:58 AM, Tim Chan <tc...@edmunds.com> wrote:

> Hi Ruslan,
>
> I'm using the trunk version of Pig.
>
> For the following script:
>
> test = LOAD '$test' USING PigStorage('\t') AS
>     ( visitor:chararray,
>       submodelid:long,
>       record_datetime:chararray );
>
> test_grp = group test by visitor;
>
> -- add counts of each bag
> test_grp_cnt = foreach test_grp
>     generate
>         *,
>         COUNT(test) as submodel_count;
>
>
> smp = filter test_grp_cnt by submodel_count < 2;
> dump smp;
>
>
> -- remove second to last item in back after sorting
> test_last_removed = FOREACH test_grp_cnt {
>     ordered = ORDER test BY record_datetime ASC;
>     last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1);
>     --last_removed = LIMIT ordered 3;
>
>     GENERATE
>         group as visitor,
>         last_removed;
> }
>
>
> I get the following error:
>
> ERROR 1066: Unable to open iterator for alias test_last_removed_smp.
> Backend error : Scalar has more than one row in the output. 1st :
> (uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06
> 08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd
> :(S:382290531917004,{(S:382290531917004,200442423,2013-02-01
> 21:15:58),(S:382290531917004,200409672,2013-02-01
> 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)
>
> The error is not present when I comment out the "last_removed..." line and
> uncommented out the one below it.
>
>
>
>
> On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <metarus...@gmail.com
> >wrote:
>
> > Hi Chan,
> >
> > Your tasks seems to be not trivial in Pig. Basically bags are not
> ordered,
> > so you have to either sort before or to decide what tuple you want to
> > remove exactly. Some ways to solve the problem:
> > 1) You can use the TOP builtin UDF which basically does the opposite and
> I
> > am not sure whether it will suit you from the performance point of view
> > 2) You can try something like this:
> > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
> > chararray);
> > grouped = GROUP inputData BY key;
> > DESCRIBE grouped;
> > DUMP grouped;
> > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
> > DESCRIBE withCounts;
> > DUMP withCounts;
> > trimmed = FOREACH withCounts {
> >         ordered = ORDER inputData BY key;
> >         limited = LIMIT ordered (withCounts.Count - 1);
> >         GENERATE
> >                 group,
> >                 limited;
> > }
> > DESCRIBE trimmed;
> > DUMP trimmed;
> >
> > This is what I got when run on Pig 0.10:
> >
> > grouped: {group: chararray,inputData: {(key: chararray,date:
> > chararray,letter: chararray)}}
> >
> >
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
> >
> > withCounts: {group: chararray,inputData: {(key: chararray,date:
> > chararray,letter: chararray)},Count: long}
> >
> >
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
> >
> > trimmed: {group: chararray,limited: {(key: chararray,date:
> > chararray,letter: chararray)}}
> > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
> >
> > I am not sure whether it will perform well. Let me know if it helps.
> >
> > Best Regards,
> > Ruslan Al-Fakikh
> >
> >
> > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <xiao...@cloudera.com>
> > wrote:
> >
> > > Hi, Chan:
> > > That's fine. How did you generate the bag with different size in
> runtime.
> > > It will be easier for me to come out a solution by this information.
> > > Thanks.
> > >
> > > Johnny
> > >
> > >
> > > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <tc...@edmunds.com> wrote:
> > >
> > > > Hi Johnny,
> > > >
> > > > I forgot to mention the bag will be varying sizes, so I can not use
> the
> > > > method you described.
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <xiao...@cloudera.com>
> > > > wrote:
> > > >
> > > > > Hi, Chan:
> > > > > I guess you might generate the bag like this
> > > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > > > B = group A by f1;
> > > > > C = foreach B generate *;
> > > > > describe C;
> > > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3:
> > > > chararray)}}
> > > > >
> > > > > if this is the case, you can do:
> > > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > > > B = group A by f1;
> > > > > C = foreach B generate group, A.f1, A.f2;
> > > > > describe C;
> > > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}}
> > > > >
> > > > > does this make sense? otherwise can you share your script which
> > > generates
> > > > > the bag?
> > > > >
> > > > > Johnny Zhang
> > > > >
> > > > >
> > > > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <tc...@edmunds.com>
> > wrote:
> > > > >
> > > > > > How do I remove the last item in a bag.
> > > > > >
> > > > > > For example:
> > > > > >
> > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)})
> > > > > >
> > > > > >
> > > > > > I would like to remove the last item so that the following is the
> > > > result:
> > > > > >
> > > > > >
> > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a)})
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to