RE: Pattern for Bulk Loading to Remote HBase Cluster

ashish singhi Wed, 08 Mar 2017 21:14:08 -0800

Hi,

Did you try giving the importtsv output path to remote HDFS ?

Regards,
Ashish

-----Original Message-----
From: Ben Roling [mailto:ben.rol...@gmail.com] 
Sent: 09 March 2017 03:22
To: user@hbase.apache.org
Subject: Pattern for Bulk Loading to Remote HBase Cluster

My organization is looking at making some changes that would introduce HBase 
bulk loads that write into a remote cluster.  Today our bulk loads write to a 
local HBase.  By local, I mean the home directory of the user preparing and 
executing the bulk load is on the same HDFS filesystem as the HBase cluster.  
In the remote cluster case, the HBase being loaded to will be on a different 
HDFS filesystem.

The thing I am wondering about is what the best pattern is for determining the 
location to write HFiles to from the job preparing the bulk load.
Typical examples write the HFiles somewhere in the user's home directory.
When HBase is local, that works perfectly well.  With remote HBase, it can 
work, but results in writing the files twice: once from the preparation job and 
a second time by the RegionServer when it reacts to the bulk load by copying 
the HFiles into the filesystem it is running on.

Ideally the preparation job would have some mechanism to know where to write 
the files such that they are initially written on the same filesystem as HBase 
itself.  This way the bulk load can simply move them into the HBase storage 
directory like happens when bulk loading to a local cluster.

I've considered a pattern where the bulk load preparation job reads the 
hbase.rootdir property and pulls the filesystem off of that.  Then, it sticks 
the output in some directory (e.g. /tmp) on that same filesystem.
I'm inclined to think that hbase.rootdir should only be considered a 
server-side property and as such I shouldn't expect it to be present in client 
configuration.  Under that assumption, this isn't really a workable strategy.

It feels like HBase should have a mechanism for sharing a staging directory 
with clients doing bulk loads.  Doing some searching, I ran across 
"hbase.bulkload.staging.dir", but my impression is that its intent does not 
exactly align with mine.  I've read about it here [1].  It seems the idea is 
that users prepare HFiles in their own directory, then SecureBulkLoad moves 
them to "hbase.bulkload.staging.dir".  A move like that isn't really a move 
when dealing with a remote HBase cluster.  Instead it is a copy.  A question 
would be why doesn't the job just write the files to 
"hbase.bulkload.staging.dir" initially and skip the extra step of moving them?

I've been inclined to invent my own application-specific Hadoop property to use 
to communicate an HBase-local staging directory with my bulk load preparation 
jobs.  I don't feel perfectly good about that idea though.  I'm curious to hear 
experiences or opinions from others.  Should I have my bulk load prep jobs look 
at "hbase.rootdir" or "hbase.bulkload.staging.dir" and make sure those get 
propagated to client configuration?  Is there some other mechanism that already 
exists for clients to discover an HBase-local directory to write the files?

[1] http://hbase.apache.org/book.html#hbase.secure.bulkload

RE: Pattern for Bulk Loading to Remote HBase Cluster

Reply via email to