I had a very large Hive table that I needed in HBase. After asking around, I came to the conclusion that my best bet was to:
1 - export the hive table to a CSV 'file'/folder on the HDFS 2 - Use the org.apache.phoenix.mapreduce.CsvBulkLoadTool to import the data. I found that if I tried to pass the entire folder (~ 1/2 TB of data) to the CsvBulkLoadTool, my job would eventually fail. Empirically, it seems that on our particular cluster, 20-30GB of data is the most that the CSVBulkLoadTool can handle at one time without so many map jobs timing out that the entire operation fails. So I passed one sub-file at a time and eventually got all the data into HBase. I tried doing a select count(*) on the table to see whether all of the rows were transferred, but this eventually fails. Today, I believe I found a set of data that is in Hive but NOT in HBase. So, I have 2 questions: 1) Are there any known errors with the CsvBulkLoadTool such that it might skip some data without getting my attention with some kind of error? 2) Is there a straightforward way to count the rows in my Phoenix table so that I can compare the Hive table with the HBase table? Thanks in advance!