My organization is looking at making some changes that would introduce
HBase bulk loads that write into a remote cluster.  Today our bulk loads
write to a local HBase.  By local, I mean the home directory of the user
preparing and executing the bulk load is on the same HDFS filesystem as the
HBase cluster.  In the remote cluster case, the HBase being loaded to will
be on a different HDFS filesystem.

The thing I am wondering about is what the best pattern is for determining
the location to write HFiles to from the job preparing the bulk load.
Typical examples write the HFiles somewhere in the user's home directory.
When HBase is local, that works perfectly well.  With remote HBase, it can
work, but results in writing the files twice: once from the preparation job
and a second time by the RegionServer when it reacts to the bulk load by
copying the HFiles into the filesystem it is running on.

Ideally the preparation job would have some mechanism to know where to
write the files such that they are initially written on the same filesystem
as HBase itself.  This way the bulk load can simply move them into the
HBase storage directory like happens when bulk loading to a local cluster.

I've considered a pattern where the bulk load preparation job reads the
hbase.rootdir property and pulls the filesystem off of that.  Then, it
sticks the output in some directory (e.g. /tmp) on that same filesystem.
I'm inclined to think that hbase.rootdir should only be considered a
server-side property and as such I shouldn't expect it to be present in
client configuration.  Under that assumption, this isn't really a workable
strategy.

It feels like HBase should have a mechanism for sharing a staging directory
with clients doing bulk loads.  Doing some searching, I ran across
"hbase.bulkload.staging.dir", but my impression is that its intent does not
exactly align with mine.  I've read about it here [1].  It seems the idea
is that users prepare HFiles in their own directory, then SecureBulkLoad
moves them to "hbase.bulkload.staging.dir".  A move like that isn't really
a move when dealing with a remote HBase cluster.  Instead it is a copy.  A
question would be why doesn't the job just write the files to
"hbase.bulkload.staging.dir" initially and skip the extra step of moving
them?

I've been inclined to invent my own application-specific Hadoop property to
use to communicate an HBase-local staging directory with my bulk load
preparation jobs.  I don't feel perfectly good about that idea though.  I'm
curious to hear experiences or opinions from others.  Should I have my bulk
load prep jobs look at "hbase.rootdir" or "hbase.bulkload.staging.dir" and
make sure those get propagated to client configuration?  Is there some
other mechanism that already exists for clients to discover an HBase-local
directory to write the files?

[1] http://hbase.apache.org/book.html#hbase.secure.bulkload

Reply via email to