Hi, Zack, Can you tell how many reducer were used? I feel that is also a key factor. I do not know Phoenix's csvbulkupload utility, is it better than HBase's native importtsv utility?
From what I used 'importtsv', the reducer number is very important, the utility first check the target hbase table about how many regions there. And start same number of reducer, if the target hbase table is not pre-created or pre-split, it will start only one reducer, which is very slow. A good reducer number is the total number of disks. 16*13 in your cases , so maybe you should presplit your hbase table with that many regions. And you should also use compression in your target hbase table, which will reduce the total writing bytes a lot and much better performance. And as I understand, with 13 nodes, you spent 3 hours to load 2T hive data (6,000 300~400m src files) into hbase, is this correct? Or 3 hours just finish the mapper's job, not including reducer? If you just ask how to speed up the mapper speed, sorry that my answer make no sense, and I don't have any better idea. As for hdfs file size, I agree with Jorn franke that larger file is better than a lot of small files. The mapper can still run as many as possible by reading different data blocks, but the name node will be very happy. But I feel it has very little improvement from bulkloading point of view. 4G, 40G or 400M should not have visible difference in loading speed. Time spend with NameNode in the bulk loading process is very tiny part. Not sure, better you can test it. As a reference, I have a 8 data nodes cluster, and 5 SSD disks each node. With importtsv, the best loading speed I can get up to now is about 350G/hour, so it will take me about 6 hours to finish your job. 350G is the raw size of the original csv file, not total size in HDFS which multiplied by 3 (the replica). If I am right, your loading speed is already double of mine, so I cannot give you any advices and want to ask for your help :-) I am struggled to achieve a better bulk loading speed as well. As a developer of Apache Trafodion, I am also using its bulkloader, and its bulkloader will have about 3x speed up of loading same data from hive to similar hbase target table than the importtsv. But the target format is encoded by Trafodion, one can only access those data via Trafodion's SQL interface, JDBC/ODBC etc. So maybe not an option for you. I will try Phoenix's bulkloader asap. I also heard a Spark bulkloader utility, but not able to try it yet. Hope to keep touch with you. Thanks, Ming -----邮件原件----- 发件人: Riesland, Zack [mailto:zack.riesl...@sensus.com] 发送时间: 2016年2月13日 23:07 收件人: user@hive.apache.org 主题: RE: Optimizing external table structure Thanks. We have 16 disks per node, to answer your question. ________________________________________ From: Jörn Franke [jornfra...@gmail.com] Sent: Saturday, February 13, 2016 9:46 AM To: user@hive.apache.org Subject: Re: Optimizing external table structure How many disk drives do you have / node? Generally one node should have 12 drives not configured as raid and not configured as lvm. Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR). I am not sure about the latest status of Phoenix but maybe you can can make hbase tables directly available as external tables in Hive - you would save a lot of time by not converting to csv. You could also explore using sqoop (import from hive to jdbc / Phoenix or to hbase). On 13 Feb 2016, at 13:41, Riesland, Zack <zack.riesl...@sensus.com<mailto:zack.riesl...@sensus.com>> wrote: On a daily basis, we move large amounts of data from hive to hbase, via phoenix. In order to do this, we create an external hive table with the data we need to move (all a subset of 1 compressed ORC table), and then use the Phoenix CsvBulkUpload utility. From everything I've read, this is the best approach. My question is: how can I optimize my external table to make the bulk upload as efficient as possible? For example, today, my external table is backed by 6,020 files in HDFS, each about 300-400mb. This results in a mapreduce operation with 12,209 mappers that takes about 3 hours (we don't have a huge cluster – 13 data nodes currently). Would it be better to have more, smaller files? Fewer, larger files?