Thanks Nick, I don't think there should be any newline chars - the columns that are turning null are mostly Decimal, and none of my strings are longer than one word (no spaces).
And again: the same row is correct if I export a small set of data, and incorrect if I export a large set - so I think that file/data size has something to do with this. That said, I think you have the right idea in general. I just tested and if I take the orc table, copy it to a sequence file, and then copy to a csv "file", everything looks good. Is that going to be my best workaround for this? -----Original Message----- From: Nicholas Hakobian [mailto:[email protected]] Sent: Thursday, January 28, 2016 3:15 PM To: [email protected] Subject: Re: "Create external table" nulling data from source table Do you have any fields with embedded newline characters? If so, certain hive output formats will parse the newline character as the end of row, and when importing, chances are the missing fields (now part of the next row) will be padded with nulls. This happens in Hive as well if you are using a TextFile intermediate format (SequenceFile does not have this problem). -Nick Nicholas Szandor Hakobian Data Scientist Rally Health [email protected] On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack <[email protected]> wrote: > First time posting to this list. Please forgive me if I break > etiquette. I’m looking for some help with getting data from hive to hbase. > > > > I’m using HDP 2.2.8. > > > > I have a compressed (zlib), orc-based hive table with 12 columns and > billions of rows. > > In order to get the data into hbase, I have to create a copy of the > table as an "external" table, backed by CSV files (unless someone > knows a better way). > > > > Then, I use the CsvBulkLoad mapreduce job to create hfiles from the > csv files backing the external table. > > I’ve been doing this for almost a year, and MOST of the data ends up > correct, but if I export a large amount of data, I end up with nulls > where I shouldn't. > > > > If I run the exact same query on the source table (compressed orc) and > destination table (external text) I get null values in the results of > the latter, but not the former. > > > > However, if I only copy a small subset of the data to the text-based > table, all the data is correct. > > > > I also noticed that if I use an uncompressed source table, and then > copy to an external text-based table, it happens much more often. > > > > So, my (not-very-educated) guess is that this has to do with ORC files. > > > > I know that there are alternatives to ORC, but Hortonworks strongly > encourages us to use ORC for everything. I’m not even sure whether > Parquet works with HDP. > > > > Anyways, Is this a known bug? > > Any ideas on how I can get around it without chopping up my data into > multiple tables?
