Santhosh Srinivasan ([email protected]) wrote: > Hi Marco, > > 1. I opened a JIRA that addresses the request for multi-byte delimiters > in PigStorage (https://issues.apache.org/jira/browse/PIG-842). Other > users have made a similar request.
Thanks! I'll watch it. > In order to access the elements of the tuple, you should flatten the > bag. However, this does not suit your use case. I think it's really interesting that _nobody_ever_ accesses TOKENIZE'd data without flatten()?! Is there a language glitch here? > Was it the performance of streaming that left you unsatisfied? > Or > Was it the fact that you had to use streaming and go out of the > language? Both, I guess. I definitely didn't like dropping out of the language for something as seemingly basic as selecting fields from unstructured data. Of course, dropping out of the language into streaming probably incurs some heavy copy in/out penalty which is acceptable, but not very cool. > We would like to hear your feedback. > > Thanks, > Santhosh > > -----Original Message----- > From: Marco Nicosia [mailto:[email protected]] > Sent: Thursday, June 11, 2009 2:09 AM > To: [email protected] > Subject: Selecting fields from records with varying spaces? > > I believe that my question/problem primarily extends from my inability > to access fields within a bag_of_tokenTuples. Here's an example: > > > grunt> cc = load 'cloud-computing' using TextLoader() as > line:chararray; > > grunt> frm = filter cc by ($0 matches '^From .*'); > > grunt> frm2 = limit frm 2; > > grunt> frm2words = foreach frm2 generate TOKENIZE(line); > > grunt> dump frm2words > > > ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@goo > glegroups.com),(Thu),(Apr),(23),(10:28:54),(2009)}) > > > ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@goo > glegroups.com),(Thu),(Apr),(23),(10:29:54),(2009)}) > > grunt> frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6; > > 2009-06-11 08:46:55,977 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 1000: Error during parsing. Out of bound access. Trying to > access non-existent column: 3. Schema {tuple_of_tokens: (token: > chararray)} has 1 column(s). > > Details at logfile: /home/marco/pig_1244709828265.log > > I don't know quite why there's only one column. I guess that column > could be the bag of tokens, but I've tried dereferencing into that, > and gotten nowhere. I must be missing something fundamental? > > Eventually I gave up fussing with bags of tokens which contain > tuples, and turned to PigStorage (way less efficient to split all > records before filtering!), which yielded a totally different problem. > > Is it possible to get PigStorage to use anything other than a single > character as a field separator? Using PigStorage(' '), the two > strings, "Jan 23" and "Jan 9" are interpreted as two and three > fields respectively. > > Here's a proof: > > grunt> cc = load 'cloud-computing' using PigStorage(' '); > > grunt> frm = filter cc by ($0 == 'From'); > > grunt> flds = group frm by (ARITY(*)); > > grunt> frmarity = foreach flds generate $0, COUNT($1); > > grunt> dump frmarity > > (8,531L) > > (9,314L) > > Each line really is the same number of fields, it's just that some > have an extra space, which is messing PigStorage up. > > If you've read this far, you might as well see how I finally "solved" > this, and unsatisfied, decided to write this e-mail: > > > grunt> cc = load 'cloud-computing' using TextLoader() as > line:chararray; > > grunt> frm = filter cc by ($0 matches '^From .*'); > > grunt> frm2 = limit frm 2; > > grunt> frmdates = stream frm2 through `awk '{print $4,$5,$7}'`; > > grunt> dump frmdates > > (Apr 23 2009) > > (Apr 23 2009) > > Terrible! > > _______________________________________________________________________ > Marco E. Nicosia | http://www.escape.org/~marco/ | [email protected] _______________________________________________________________________ Marco E. Nicosia | http://www.escape.org/~marco/ | [email protected]
