I had a very large Hive table that I needed in HBase.

After asking around, I came to the conclusion that my best bet was to:

1 - export the hive table to a CSV 'file'/folder on the HDFS
2 - Use the org.apache.phoenix.mapreduce.CsvBulkLoadTool to import the data.

I found that if I tried to pass the entire folder (~ 1/2 TB of data) to the 
CsvBulkLoadTool, my job would eventually fail.

Empirically, it seems that on our particular cluster, 20-30GB of data is the 
most that the CSVBulkLoadTool can handle at one time without so many map jobs 
timing out that the entire operation fails.

So I passed one sub-file at a time and eventually got all the data into HBase.

I tried doing a select count(*)  on the table to see whether all of the rows 
were transferred, but this eventually fails.

Today, I believe I found a set of data that is in Hive but NOT in HBase.

So, I have 2 questions:

1) Are there any known errors with the CsvBulkLoadTool such that it might skip 
some data without getting my attention with some kind of error?

2) Is there a straightforward way to count the rows in my Phoenix table so that 
I can compare the Hive table with the HBase table?

Thanks in advance!

Reply via email to