Hi all,
I'm having some issues working with records containing new lines after deserialization. Some information: 1. The serialized records do not contain new lines (they are base64 encoded protobuf messages). 2. The deserialized records DO contain new lines (our SerDe base64 decodes them and extracts fields from the reconstructed protobuf object). The issue we are seeing is this: the records in our output are being incorrectly delimited by the newlines in the deserialized strings. hive> SELECT json_data FROM foo_apa LIMIT 10; ... OK { "fluff": "some fluff to give us a newline", "user": "user187144073220" } { "fluff": "some fluff to give us a newline", "user": "user187144199985" } { "fluff": "some fluff to give us a newline", Note that there 10 lines instead of 10 records (each JSON dictionary is a record here). When selecting more than one field, this leads to even more messed up results, with random NULLs in the other fields (which I know are not null). This sort of issue can be reproduced even without any newlines in the data itself. Consider this table: CREATE TABLE blah (foo int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'; If you do:
hive SELECT "jfjfjf \n jasdfsadf" FROM blah LIMIT 10;
... OK jfjfjf jasdfsadf jfjfjf jasdfsadf jfjfjf jasdfsadf jfjfjf jasdfsadf jfjfjf jasdfsadf Note again that there are 10 lines, instead of 10 records. Obviously, there's some point in the query execution that treats the newlines as record separators. Based on the EXPLAIN for the above query, I suspect it's the "File Output Operator", which uses org.apache.hadoop.mapred.TextInputFormat as its input format, though I'm not too sure at this point. Has anyone run into this/have any suggestions? Thanks for the help! Andrew