Hi Marco, 1. I opened a JIRA that addresses the request for multi-byte delimiters in PigStorage (https://issues.apache.org/jira/browse/PIG-842). Other users have made a similar request.
2. TOKENIZE produces a bag of tuples; each tuple contains a string. Bags contain unordered tuples. The language does not support accessing tuples by position. As a result, you will not be able to do things like: frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6; In order to access the elements of the tuple, you should flatten the bag. However, this does not suit your use case. 3. I am glad to see that streaming helped you solve the problem. Was it the performance of streaming that left you unsatisfied? Or Was it the fact that you had to use streaming and go out of the language? We would like to hear your feedback. Thanks, Santhosh -----Original Message----- From: Marco Nicosia [mailto:[email protected]] Sent: Thursday, June 11, 2009 2:09 AM To: [email protected] Subject: Selecting fields from records with varying spaces? I believe that my question/problem primarily extends from my inability to access fields within a bag_of_tokenTuples. Here's an example: > grunt> cc = load 'cloud-computing' using TextLoader() as line:chararray; > grunt> frm = filter cc by ($0 matches '^From .*'); > grunt> frm2 = limit frm 2; > grunt> frm2words = foreach frm2 generate TOKENIZE(line); > grunt> dump frm2words > ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@goo glegroups.com),(Thu),(Apr),(23),(10:28:54),(2009)}) > ({(From),(grbounce-nptejauaaacwimcqbpj4db4q5z5lpobj=marco=escape....@goo glegroups.com),(Thu),(Apr),(23),(10:29:54),(2009)}) > grunt> frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6; > 2009-06-11 08:46:55,977 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Out of bound access. Trying to access non-existent column: 3. Schema {tuple_of_tokens: (token: chararray)} has 1 column(s). > Details at logfile: /home/marco/pig_1244709828265.log I don't know quite why there's only one column. I guess that column could be the bag of tokens, but I've tried dereferencing into that, and gotten nowhere. I must be missing something fundamental? Eventually I gave up fussing with bags of tokens which contain tuples, and turned to PigStorage (way less efficient to split all records before filtering!), which yielded a totally different problem. Is it possible to get PigStorage to use anything other than a single character as a field separator? Using PigStorage(' '), the two strings, "Jan 23" and "Jan 9" are interpreted as two and three fields respectively. Here's a proof: > grunt> cc = load 'cloud-computing' using PigStorage(' '); > grunt> frm = filter cc by ($0 == 'From'); > grunt> flds = group frm by (ARITY(*)); > grunt> frmarity = foreach flds generate $0, COUNT($1); > grunt> dump frmarity > (8,531L) > (9,314L) Each line really is the same number of fields, it's just that some have an extra space, which is messing PigStorage up. If you've read this far, you might as well see how I finally "solved" this, and unsatisfied, decided to write this e-mail: > grunt> cc = load 'cloud-computing' using TextLoader() as line:chararray; > grunt> frm = filter cc by ($0 matches '^From .*'); > grunt> frm2 = limit frm 2; > grunt> frmdates = stream frm2 through `awk '{print $4,$5,$7}'`; > grunt> dump frmdates > (Apr 23 2009) > (Apr 23 2009) Terrible! _______________________________________________________________________ Marco E. Nicosia | http://www.escape.org/~marco/ | [email protected]
