Re: Cumulative totals in an ORDERed relation.

Zach Bailey Fri, 17 Dec 2010 15:33:08 -0800

 I believe what you're trying to do is this. You have some sort of data, and a 
timestamp:



What you want to figure out is how many times each possible value of "data" 
appears in a certain time period (say, hourly).


Let's say data can have three possible string values: {'a', 'b', 'c'}


Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted 
date (I would strongly recommend using one of these since there are already 
piggybank functions to slice and dice them).


To accumulate all the times that the data 'a' appeared in an hour you would do 
something like this:


--register piggybank.jar for iso date functions
REGISTER ./piggybank.jar
allData = load ... as (string:chararray, ts:long);
--convert ts to ISO Date, and truncate to the hour
allDataISODates = FOREACH allData GENERATE string, 
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
 as isoHour;
-- group by hour and string
groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
-- append counts
stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as 
string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;


You will now have a relation that looks like:
{'a', '2010-12-13T12:00:00', 2334}
{'b', '2010-12-13T12:00:00', 123}
{'c', '2010-12-13T12:00:00', 3}
{'a', '2010-12-13T13:00:00', 34231}
{'b', '2010-12-13T13:00:00', 34}
{'c', '2010-12-13T13:00:00', 134}


Is that the sort of thing you're looking to do?

-Zach


On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:

> What you are suggesting seems to be a fundamentally single-threaded process
> (well, it can be parallelized, but it's not pretty and involves multiple
> passes), so it's not a good fit for the map-reduce paradigm (how would you
> do accumulative totals for 25 billion entries?). Pig tends to avoid
> implementing methods that restrict scaling computations in this way. Your
> idea of streaming through a script would work; you could also write an
> accumulative UDF and use it on the result of doing a GROUP ALL on your
> relation.
> 
> -Dmitriy
> 
> On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <k...@melon.org> wrote:
> 
> 
> >  Hello,
> > 
> >  Is there some sort of mechanism by which I could cause a value to
> >  accumulate within a relation? What I'd like to do is something along the
> >  lines of having a long called accumulator, and an outer bag called
> >  hourlyTotals with a schema of (hour:int, collected:int)
> > 
> >  accumulator = 0L; -- I know this line doesn't work
> >  ORDER hourlyTotals BY collected;
> >  cumulativeTotals = FOREACH hourlyTotals {
> >  accumulator += collected;
> >  GENERATE day, accumulator AS collected;
> >  }
> > 
> >  Could something like this be made to work? Is there something similar that
> >  I can do instead? Do I just need to pipe the relation through an
> >  external script to get what I want?
> > 
> >  Thanks,
> >  Kris
> > 
> >  --
> >  Kris Coward http://unripe.melon.org/
> >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > 
> > 
> > 
> 
> 
> 
>

Re: Cumulative totals in an ORDERed relation.

Reply via email to