@Mehmet... great hack! I like it :-P

On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu <
mehmets...@yahoo.com> wrote:

> If you know how many items you want from each inner bag exactly, you can
> hack it like this:
>
> x = foreach x {
>     y = foreach x generate RANDOM() as rnd, *;
>     y = order y by rnd;
>     y = limit y $SAMPLE_NUM;
>     y = foreach y generate $1 ..;
>     generate group, y;
> }
>
> Basically randomize the inner bag, sort it wrt the random number and limit
> it to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you
> don’t like this, you might have to write your own udf.
>
> Mehmet
>
> On May 27, 2014, at 10:03 AM, <william.dowl...@thomsonreuters.com> <
> william.dowl...@thomsonreuters.com> wrote:
>
> > Hi Pig users,
> >
> > Is there an easy/efficient way to sample an inner bag? For example, with
> input in a relation like
> >
> > (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> > (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> > (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> >
> > I’d like to sample 1/3 the elements of the bags, and get something like
> (ignoring the non-determinism)
> > (id1,att1,{(x,0.999749968742)})
> > (id1,att2,{(b,0.04)})
> > (id2,att1,{(b,0.05)})
> >
> > I have a circumlocution that seems to work using flatten+ group but that
> looks ugly to me:
> >
> > tfidf1 = load '$tfidf' as (id: chararray,
> >                          att: chararray,
> >                          pairs: {pair: (word: chararray, value:
> double)});
> >
> > flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> > sample_flat_tfidf = sample flat_tfidf 0.33;
> > tfidf2 = group sample_flat_tfidf by (id, att);
> >
> > tfidf = foreach tfidf2 {
> >   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
> >   generate group.id, group.att, pairs;
> > };
> >
> > Can someone suggest a better way to do this?  Many thanks!
> >
> > William F Dowling
> > Senior Technologist
> >
> > Thomson Reuters
> >
> >
> >
>
>

Reply via email to