Do you want to split on the chukwa payload fields or the fields in the record body?
I have scripts that do similar things with the body using FILTER and a custom TOKENIZE udf I wrote to tokenize the body content. I'm using the latest ChukwaLoader for Pig 0.7.0, but the previous one should work the same way. define chukwaLoader org.apache.hadoop.chukwa.pig.ChukwaLoader(); define tokenize my.udfs.TOKENIZE(); raw = LOAD '/your/path' USING chukwaLoader AS (ts: long, fields); bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as tokens, timePeriod(ts) as time; bodies_this_period = FILTER bodies BY ((chararray)time == '[some timestamp]'); STORE bodies_this_period INTO '/some/output/path' >From bodies_this_period you can access the different tokens using $0.token0, bodies_this_period1, etc... I wrote TOKENIZE to return an ordered tuple of the values found, since Pig's TOKENIZE returns an unordered bag, which isn't that useful in this case. HTH, Bill On Mon, Oct 4, 2010 at 2:35 PM, Jerome Boulon <[email protected]> wrote: > Hi Matt, > When I designed this, the schema was NOT available in Pig. I’m not sure if > this has changed or not. > So I’m using the constructor as a way to get around the lack of schema > definition but if you can get it now from the query & the storage handler > then it should be a pretty easy thing todo. > So do you know if the sql schema is now available in Pig? > > /Jerome. > > On 10/4/10 2:28 PM, "Matt Davies" <[email protected]> wrote: > > Hey all- > > Trying to do some operations utilizing Chukwa and Pig. Would like to > basically > > 1. Read in the data from HDFS > 2. Do some SPLIT operations > 3. write the various files out with all the fields as seen during the > loading phase. > > > So, my question is this - is there a way to utilize the > org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store > out all the various fields without having to individually define fields in > the ChukwaStorage constructor? > > > Thanks, > Matt > > > >
