@Mehmet... great hack! I like it :-P
On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu < mehmets...@yahoo.com> wrote: > If you know how many items you want from each inner bag exactly, you can > hack it like this: > > x = foreach x { > y = foreach x generate RANDOM() as rnd, *; > y = order y by rnd; > y = limit y $SAMPLE_NUM; > y = foreach y generate $1 ..; > generate group, y; > } > > Basically randomize the inner bag, sort it wrt the random number and limit > it to the sample size you want. No reducers needed. > If the inner bags are huge, ordering will obviously be expensive. If you > don’t like this, you might have to write your own udf. > > Mehmet > > On May 27, 2014, at 10:03 AM, <william.dowl...@thomsonreuters.com> < > william.dowl...@thomsonreuters.com> wrote: > > > Hi Pig users, > > > > Is there an easy/efficient way to sample an inner bag? For example, with > input in a relation like > > > > (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)}) > > (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)}) > > (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)}) > > > > I’d like to sample 1/3 the elements of the bags, and get something like > (ignoring the non-determinism) > > (id1,att1,{(x,0.999749968742)}) > > (id1,att2,{(b,0.04)}) > > (id2,att1,{(b,0.05)}) > > > > I have a circumlocution that seems to work using flatten+ group but that > looks ugly to me: > > > > tfidf1 = load '$tfidf' as (id: chararray, > > att: chararray, > > pairs: {pair: (word: chararray, value: > double)}); > > > > flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs); > > sample_flat_tfidf = sample flat_tfidf 0.33; > > tfidf2 = group sample_flat_tfidf by (id, att); > > > > tfidf = foreach tfidf2 { > > pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value; > > generate group.id, group.att, pairs; > > }; > > > > Can someone suggest a better way to do this? Many thanks! > > > > William F Dowling > > Senior Technologist > > > > Thomson Reuters > > > > > > > >