Re: Pattern for Bulk Loading to Remote HBase Cluster

Ben Roling Thu, 09 Mar 2017 06:23:23 -0800

I'm not sure you understand my question.  Or perhaps I just don't quite
understand yours?


I'm not using importtsv.  If I was, and I was using the form that prepares
StoreFiles for completebulkload, then my question would be, how do I
(generically as an application acting as an HBase client, and using
importtsv to load data) choose the path to which I write the StoreFiles?

The following is an example of importtsv from the documentation:

bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.columns=a,b,c
-Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename>
<hdfs-data-inputdir>

How do I choose hdfs://storefile-outputdir in a way that does not perform
an extra copy operation when completebulkload is invoked, without assuming
knowledge of HBase server implementation details?

In essence, how does my client application know that it should write to
hdfs://cluster2 even though the application is running in a context where
fs.defaultFs is hdfs://cluster1?

How does the HBase installation share this information with client
applications?

I know I can just go dig into the hdfs-site.xml on a RegionServer and
figure this out (such as by looking at "hbase.rootdir" there), but my
question is how to do it from the perspective of a generic HBase client
application?

On Wed, Mar 8, 2017 at 11:13 PM ashish singhi <ashish.sin...@huawei.com>
wrote:

> Hi,
>
> Did you try giving the importtsv output path to remote HDFS ?
>
> Regards,
> Ashish
>
> -----Original Message-----
> From: Ben Roling [mailto:ben.rol...@gmail.com]
> Sent: 09 March 2017 03:22
> To: user@hbase.apache.org
> Subject: Pattern for Bulk Loading to Remote HBase Cluster
>
> My organization is looking at making some changes that would introduce
> HBase bulk loads that write into a remote cluster.  Today our bulk loads
> write to a local HBase.  By local, I mean the home directory of the user
> preparing and executing the bulk load is on the same HDFS filesystem as the
> HBase cluster.  In the remote cluster case, the HBase being loaded to will
> be on a different HDFS filesystem.
>
> The thing I am wondering about is what the best pattern is for determining
> the location to write HFiles to from the job preparing the bulk load.
> Typical examples write the HFiles somewhere in the user's home directory.
> When HBase is local, that works perfectly well.  With remote HBase, it can
> work, but results in writing the files twice: once from the preparation job
> and a second time by the RegionServer when it reacts to the bulk load by
> copying the HFiles into the filesystem it is running on.
>
> Ideally the preparation job would have some mechanism to know where to
> write the files such that they are initially written on the same filesystem
> as HBase itself.  This way the bulk load can simply move them into the
> HBase storage directory like happens when bulk loading to a local cluster.
>
> I've considered a pattern where the bulk load preparation job reads the
> hbase.rootdir property and pulls the filesystem off of that.  Then, it
> sticks the output in some directory (e.g. /tmp) on that same filesystem.
> I'm inclined to think that hbase.rootdir should only be considered a
> server-side property and as such I shouldn't expect it to be present in
> client configuration.  Under that assumption, this isn't really a workable
> strategy.
>
> It feels like HBase should have a mechanism for sharing a staging
> directory with clients doing bulk loads.  Doing some searching, I ran
> across "hbase.bulkload.staging.dir", but my impression is that its intent
> does not exactly align with mine.  I've read about it here [1].  It seems
> the idea is that users prepare HFiles in their own directory, then
> SecureBulkLoad moves them to "hbase.bulkload.staging.dir".  A move like
> that isn't really a move when dealing with a remote HBase cluster.  Instead
> it is a copy.  A question would be why doesn't the job just write the files
> to "hbase.bulkload.staging.dir" initially and skip the extra step of moving
> them?
>
> I've been inclined to invent my own application-specific Hadoop property
> to use to communicate an HBase-local staging directory with my bulk load
> preparation jobs.  I don't feel perfectly good about that idea though.  I'm
> curious to hear experiences or opinions from others.  Should I have my bulk
> load prep jobs look at "hbase.rootdir" or "hbase.bulkload.staging.dir" and
> make sure those get propagated to client configuration?  Is there some
> other mechanism that already exists for clients to discover an HBase-local
> directory to write the files?
>
> [1] http://hbase.apache.org/book.html#hbase.secure.bulkload
>

Re: Pattern for Bulk Loading to Remote HBase Cluster

Reply via email to