Well for the step you're describing (which I need to do as a preliminary
step to accumulating the hours), I just do something in the vein of

NewRel = GROUP OldRel BY timestamp/3600;
HourlyRel = FOREACH NewRel GENERATE group as hour, OldRel.something AS 
something,...;

(Noting that timestamp is stored as a long, so I get integer division
and the GROUP does what's wanted)

Dmitriy was right both about what I was trying to to, and that it's an
inherently serial operation.

Thanks,
Kris

On Fri, Dec 17, 2010 at 06:32:38PM -0500, Zach Bailey wrote:
> 
>  I believe what you're trying to do is this. You have some sort of data, and 
> a timestamp:
> 
> 
> What you want to figure out is how many times each possible value of "data" 
> appears in a certain time period (say, hourly).
> 
> 
> Let's say data can have three possible string values: {'a', 'b', 'c'}
> 
> 
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted 
> date (I would strongly recommend using one of these since there are already 
> piggybank functions to slice and dice them).
> 
> 
> To accumulate all the times that the data 'a' appeared in an hour you would 
> do something like this:
> 
> 
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string, 
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
>  as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as 
> string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
> 
> 
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
> 
> 
> Is that the sort of thing you're looking to do?
> 
> -Zach
> 
> 
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
> 
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> > 
> > -Dmitriy
> > 
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <k...@melon.org> wrote:
> > 
> > 
> > >  Hello,
> > > 
> > >  Is there some sort of mechanism by which I could cause a value to
> > >  accumulate within a relation? What I'd like to do is something along the
> > >  lines of having a long called accumulator, and an outer bag called
> > >  hourlyTotals with a schema of (hour:int, collected:int)
> > > 
> > >  accumulator = 0L; -- I know this line doesn't work
> > >  ORDER hourlyTotals BY collected;
> > >  cumulativeTotals = FOREACH hourlyTotals {
> > >  accumulator += collected;
> > >  GENERATE day, accumulator AS collected;
> > >  }
> > > 
> > >  Could something like this be made to work? Is there something similar 
> > > that
> > >  I can do instead? Do I just need to pipe the relation through an
> > >  external script to get what I want?
> > > 
> > >  Thanks,
> > >  Kris
> > > 
> > >  --
> > >  Kris Coward http://unripe.melon.org/
> > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 
> 

-- 
Kris Coward                                     http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Reply via email to