Re: Parallel Scanner

Richard Startin Mon, 20 Feb 2017 03:16:26 -0800

For a client only solution, have you looked at the RegionLocator interface? It 
gives you a list of pairs of byte[] (the start and stop keys for each region). 
You can easily use a ForkJoinPool recursive task or java 8 parallel stream over 
that list. I implemented a spark RDD to do that and wrote about it with code 
samples here:


https://richardstartin.com/2016/11/07/co-locating-spark-partitions-with-hbase-regions/

Forget about the spark details in the post (and forget that Hortonworks have a 
library to do the same thing :)) the idea of creating one scan per region and 
setting scan starts and stops from the region locator would give you a parallel 
scan. Note you can also group the scans by region server.

Cheers,
Richard
On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:anilk...@gmail.com>> 
wrote:

Thanks Ram. I will look into EndPoints.

On 20 February 2017 at 12:29, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>> 
wrote:

Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil 
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote:

Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start
key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
- Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>> 
wrote:

Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then
this
start and end row may change so in that case it is better you try to
get
the regions every time before you issue a scan. Does that make sense to
you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil 
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote:

Hi ,

I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation
by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
           for each region {
                       int i = 0;
                   List<HRegionLocation> regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
true);

                   for (HRegionLocation region : regions){
                   startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
                   i++;
                   endRow = i == regions.size()? scans.getStopRow()
:
region.getRegionInfo().getEndKey();
                    }
          }

are there any alternatives to achieve parallel scan? Thanks.

Thanks

Re: Parallel Scanner

Reply via email to