Right, that's a good point, it is a non-parallelizable process. I
probably should just dump it through a script, since even an entire
century of data would be <1M hours and not really need to take advantage
of the cluster. ISTR there's some pretty good functionality for that, so
I just need to look it up in the documentation again.

Thanks,
Kris

On Fri, Dec 17, 2010 at 03:22:53PM -0800, Dmitriy Ryaboy wrote:
> What you are suggesting seems to be a fundamentally single-threaded process
> (well, it can be parallelized, but it's not pretty and involves multiple
> passes), so it's not a good fit for the map-reduce paradigm (how would you
> do accumulative totals for 25 billion entries?).  Pig tends to avoid
> implementing methods that restrict scaling computations in this way. Your
> idea of streaming through a script would work; you could also write an
> accumulative UDF and use it on the result of doing a GROUP ALL on your
> relation.
> 
> -Dmitriy
> 
> On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <k...@melon.org> wrote:
> 
> > Hello,
> >
> > Is there some sort of mechanism by which I could cause a value to
> > accumulate within a relation? What I'd like to do is something along the
> > lines of having a long called accumulator, and an outer bag called
> > hourlyTotals with a schema of (hour:int, collected:int)
> >
> > accumulator = 0L; -- I know this line doesn't work
> > ORDER hourlyTotals BY collected;
> > cumulativeTotals = FOREACH hourlyTotals {
> >                        accumulator += collected;
> >                        GENERATE day, accumulator AS collected;
> >                        }
> >
> > Could something like this be made to work? Is there something similar that
> > I can do instead? Do I just need to pipe the relation through an
> > external script to get what I want?
> >
> > Thanks,
> > Kris
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >

Reply via email to