[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-947: ------------------------------- Fix Version/s: (was: 0.8.0) I don't think anybody is signed up for this issue. Please, relink to the release if you are interested to work on it and assign to yourself. > Parsing Bags by PigStorage is not handled correctly if whitespace before > start of tuple. > ---------------------------------------------------------------------------------------- > > Key: PIG-947 > URL: https://issues.apache.org/jira/browse/PIG-947 > Project: Pig > Issue Type: Bug > Components: data > Environment: Pig on Hadoop 18 > Reporter: Gandul Azul > > PigStorage parser for bags is not working correctly when a tuple in a bag is > proceeded by a space. For example, the following is parsed correctly: > {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} > while this is not: (Note the space before the second tuple) > {(-5.243084,3.142401,0.000138,2.071200,0), > (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} > It seems that the parser when it encounters the space, treats the rest of the > line as a String. With a schema, this results in a typecast of string to > databag which results in exception. > |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field > being converted to type bag, caught ParseException <Encountered " <STRING> " > "" at |line 1, column 43. > |Was expecting: > | "(" ... > | > field discarded > Below is the parser debug output for the parsing of the above error sequence: > "2.071200,0), (" from above... > ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ****** > Call: AtomDatum > Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31> > Return: AtomDatum > Return: Datum > Matched the empty string as <STRING> token. > Current character : , (44) at line 1 column 39 > No more string literal token matches are possible. > Currently matched the first 1 characters as a "," token. > ****** FOUND A "," MATCH (,) ****** > Consumed token: <"," at line 1 column 39> > Call: Datum > Matched the empty string as <STRING> token. > Current character : 0 (48) at line 1 column 40 > No string literal matches possible. > Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> > } > Current character : 0 (48) at line 1 column 40 > Currently matched the first 1 characters as a <SIGNEDINTEGER> token. > Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, > <DOUBLENUMBER>, <LONGINTEGER>, > <FLOATNUMBER> } > Current character : ) (41) at line 1 column 41 > Currently matched the first 1 characters as a <SIGNEDINTEGER> token. > Putting back 1 characters into the input stream. > ****** FOUND A <SIGNEDINTEGER> MATCH (0) ****** > Call: AtomDatum > Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40> > Return: AtomDatum > Return: Datum > Matched the empty string as <STRING> token. > Current character : ) (41) at line 1 column 41 > No more string literal token matches are possible. > Currently matched the first 1 characters as a ")" token. > ****** FOUND A ")" MATCH ()) ****** > Return: Tuple > Consumed token: <")" at line 1 column 41> > Matched the empty string as <STRING> token. > Current character : , (44) at line 1 column 42 > No more string literal token matches are possible. > Currently matched the first 1 characters as a "," token. > ****** FOUND A "," MATCH (,) ****** > Consumed token: <"," at line 1 column 42> > Matched the empty string as <STRING> token. > Current character : (32) at line 1 column 43 > No string literal matches possible. > Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> > } > Current character : (32) at line 1 column 43 > Currently matched the first 1 characters as a <STRING> token. > Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, > <DOUBLENUMBER> } > Current character : ( (40) at line 1 column 44 > Currently matched the first 1 characters as a <STRING> token. > Putting back 1 characters into the input stream. > ****** FOUND A <STRING> MATCH ( ) ****** > Return: Bag > Return: Datum > Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.