I believe that my question/problem primarily extends from my inability
to access fields within a bag_of_tokenTuples. Here's an example:
> grunt> cc = load 'cloud-computing' using TextLoader() as line:chararray;
> grunt> frm = filter cc by ($0 matches '^From .*');
> grunt> frm2 = limit frm 2;
> grunt> frm2words = foreach frm2 generate TOKENIZE(line);
> grunt> dump frm2words
> ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@googlegroups.com),(Thu),(Apr),(23),(10:28:54),(2009)})
> ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@googlegroups.com),(Thu),(Apr),(23),(10:29:54),(2009)})
> grunt> frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6;
> 2009-06-11 08:46:55,977 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
> 1000: Error during parsing. Out of bound access. Trying to access
> non-existent column: 3. Schema {tuple_of_tokens: (token: chararray)} has 1
> column(s).
> Details at logfile: /home/marco/pig_1244709828265.log
I don't know quite why there's only one column. I guess that column
could be the bag of tokens, but I've tried dereferencing into that,
and gotten nowhere. I must be missing something fundamental?
Eventually I gave up fussing with bags of tokens which contain
tuples, and turned to PigStorage (way less efficient to split all
records before filtering!), which yielded a totally different problem.
Is it possible to get PigStorage to use anything other than a single
character as a field separator? Using PigStorage(' '), the two
strings, "Jan 23" and "Jan 9" are interpreted as two and three
fields respectively.
Here's a proof:
> grunt> cc = load 'cloud-computing' using PigStorage(' ');
> grunt> frm = filter cc by ($0 == 'From');
> grunt> flds = group frm by (ARITY(*));
> grunt> frmarity = foreach flds generate $0, COUNT($1);
> grunt> dump frmarity
> (8,531L)
> (9,314L)
Each line really is the same number of fields, it's just that some
have an extra space, which is messing PigStorage up.
If you've read this far, you might as well see how I finally "solved"
this, and unsatisfied, decided to write this e-mail:
> grunt> cc = load 'cloud-computing' using TextLoader() as line:chararray;
> grunt> frm = filter cc by ($0 matches '^From .*');
> grunt> frm2 = limit frm 2;
> grunt> frmdates = stream frm2 through `awk '{print $4,$5,$7}'`;
> grunt> dump frmdates
> (Apr 23 2009)
> (Apr 23 2009)
Terrible!
_______________________________________________________________________
Marco E. Nicosia | http://www.escape.org/~marco/ | [email protected]