Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple. ----------------------------------------------------------------------------------------
Key: PIG-947 URL: https://issues.apache.org/jira/browse/PIG-947 Project: Pig Issue Type: Bug Components: data Environment: Pig on Hadoop 18 Reporter: Gandul Azul PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. Accordingly, because of this, when using pigstorage to output a bag, it cannot be loaded using pigstorage because of this inconsistency. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException <Encountered " <STRING> " "" at |line 1, column 43. |Was expecting: | "(" ... | > field discarded Below is the parser debug output for the parsing of the above error sequence: "2.071200,0), (" from above... ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ****** Call: AtomDatum Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 39> Call: Datum Matched the empty string as <STRING> token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER>, <LONGINTEGER>, <FLOATNUMBER> } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Putting back 1 characters into the input stream. ****** FOUND A <SIGNEDINTEGER> MATCH (0) ****** Call: AtomDatum Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ")" token. ****** FOUND A ")" MATCH ()) ****** Return: Tuple Consumed token: <")" at line 1 column 41> Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 42> Matched the empty string as <STRING> token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a <STRING> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a <STRING> token. Putting back 1 characters into the input stream. ****** FOUND A <STRING> MATCH ( ) ****** Return: Bag Return: Datum Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.