Hi, Zack,

Can you tell how many reducer were used? I feel that is also a key factor.
I do not know Phoenix's csvbulkupload utility, is it better than HBase's native 
importtsv utility?

From what I used 'importtsv', the reducer number is very important, the utility 
first check the target hbase table about how many regions there. And start same 
number of reducer, if the target hbase table is not pre-created or pre-split, 
it will start only one reducer, which is very slow. A good reducer number is 
the total number of disks. 16*13 in your cases , so maybe you should presplit 
your hbase table with that many regions.
And you should also use compression in your target hbase table, which will 
reduce the total writing bytes a lot and much better performance.

And as I understand, with 13 nodes, you spent 3 hours to load 2T hive data 
(6,000 300~400m src files) into hbase, is this correct? Or 3 hours just finish 
the mapper's job, not including reducer? If you just ask how to speed up the 
mapper speed, sorry that my answer make no sense, and I don't have any better 
idea.

As for hdfs file size, I agree with Jorn franke that larger file is better than 
a lot of small files. The mapper can still run as many as possible by reading 
different data blocks, but the name node will be very happy. But I feel it has 
very little improvement from bulkloading point of view. 4G, 40G or 400M should 
not have visible difference in loading speed. Time spend with NameNode in the 
bulk loading process is very tiny part. Not sure, better you can test it.

As a reference, I have a 8 data nodes cluster, and 5 SSD disks each node. With 
importtsv, the best loading speed I can get up to now is about 350G/hour, so it 
will take me about 6 hours to finish your job. 350G is the raw size of the 
original csv file, not total size in HDFS which multiplied by 3 (the replica). 
If I am right, your loading speed is already double of mine, so I cannot give 
you any advices and want to ask for your help :-)

I am struggled to achieve a better bulk loading speed as well. As a developer 
of Apache Trafodion, I am also using its bulkloader, and its bulkloader will 
have about 3x speed up of loading same data from hive to similar hbase target 
table than the importtsv. But the target format is encoded by Trafodion, one 
can only access those data via Trafodion's SQL interface, JDBC/ODBC etc. So 
maybe not an option for you. 

I will try Phoenix's bulkloader asap. I also heard a Spark bulkloader utility, 
but not able to try it yet. Hope to keep touch with you.

Thanks,
Ming


-----邮件原件-----
发件人: Riesland, Zack [mailto:zack.riesl...@sensus.com] 
发送时间: 2016年2月13日 23:07
收件人: user@hive.apache.org
主题: RE: Optimizing external table structure

Thanks.

We have 16 disks per node, to answer your question.
________________________________________
From: Jörn Franke [jornfra...@gmail.com]
Sent: Saturday, February 13, 2016 9:46 AM
To: user@hive.apache.org
Subject: Re: Optimizing external table structure

How many disk drives do you have / node?
Generally one node should have 12 drives not configured as raid and not 
configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will 
thank you) or use Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make 
hbase tables directly available as external tables in Hive - you would save a 
lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to 
hbase).

On 13 Feb 2016, at 13:41, Riesland, Zack 
<zack.riesl...@sensus.com<mailto:zack.riesl...@sensus.com>> wrote:

On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to 
move (all a subset of 1 compressed ORC table), and then use the Phoenix 
CsvBulkUpload utility. From everything I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as 
efficient as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each 
about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 
hours (we don't have a huge cluster – 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?

Reply via email to