Hi all,

I'm having some issues working with records containing new lines after
deserialization. Some information:

1. The serialized records do not contain new lines (they are base64
encoded protobuf messages).
2. The deserialized records DO contain new lines (our SerDe base64
decodes them and extracts fields from the reconstructed protobuf object).

The issue we are seeing is this: the records in our output are being
incorrectly delimited by the newlines in the deserialized strings.

hive> SELECT json_data FROM foo_apa LIMIT 10;
...
OK
{
  "fluff": "some fluff to give us a newline",
  "user": "user187144073220"
}
{
  "fluff": "some fluff to give us a newline",
  "user": "user187144199985"
}
{
  "fluff": "some fluff to give us a newline",

Note that there 10 lines instead of 10 records (each JSON dictionary is
a record here).

When selecting more than one field, this leads to even more messed up
results, with random NULLs in the other fields (which I know are not null).

This sort of issue can be reproduced even without any newlines in the
data itself. Consider this table:

CREATE  TABLE blah (foo int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat';

If you do:

hive SELECT "jfjfjf \n jasdfsadf" FROM blah LIMIT 10;
...
OK
jfjfjf
 jasdfsadf
jfjfjf
 jasdfsadf
jfjfjf
 jasdfsadf
jfjfjf
 jasdfsadf
jfjfjf
 jasdfsadf

Note again that there are 10 lines, instead of 10 records.

Obviously, there's some point in the query execution that treats the
newlines as record separators. Based on the EXPLAIN for the above query,
I suspect it's the "File Output Operator", which uses
org.apache.hadoop.mapred.TextInputFormat as its input format, though I'm
not too sure at this point.

Has anyone run into this/have any suggestions?

Thanks for the help!

Andrew



Reply via email to