Re: Parallel Scanner

Ted Yu Mon, 20 Feb 2017 07:54:59 -0800

Among the 5 columns, do you know roughly the data distribution ?

You should put the columns whose data distribution is relatively even
first. Of course, there may be business requirement which you take into
consideration w.r.t. the composite key.


If you cannot change the schema, do you have control over the region size ?
Smaller region may lower the variance in data distribution per region.

On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote:

> Hi Ted,
>
> Current region size is 10 GB.
>
> Hbase row key designed like a phoenix primary key. I can say it is like 5
> column composite key. Prefix for a common set of data would have same first
> prefix. I am not sure how to convey the data distribution.
>
> Thanks.
>
> On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Anil:
> > What's the current region size you use ?
> >
> > Given a region, do you have some idea how the data is distributed within
> > the region ?
> >
> > Cheers
> >
> > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
> >
> > > i understand my original post now :)  Sorry about that.
> > >
> > > now the challenge is to split a start key and end key at client side to
> > > allow parallel scans on table with no buckets, pre-salting.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com> wrote:
> > >
> > > > You are trying to scan one region itself in parallel, then even I got
> > you
> > > > wrong. Richard's suggestion is the right choice for client only soln.
> > > >
> > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > > >
> > > > > Thanks Richard :)
> > > > >
> > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> if
> > > it's
> > > > > > available in place of whatever is still available on HTable for
> > your
> > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > HTable::getRegionsInRange no longer exists on the current master
> > > > branch.
> > > > > >
> > > > > >
> > > > > > "I am trying to scan a region in parallel :)"
> > > > > >
> > > > > >
> > > > > > I thought you were asking about scanning many regions at the same
> > > time,
> > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > parallelising
> > > > > > scans over regions, not within regions.
> > > > > >
> > > > > >
> > > > > > If you want to parallelise within a region, you could write a
> > little
> > > > > > method to split the first and last key of the region into several
> > > > > disjoint
> > > > > > lexicographic buckets and create a scan for each bucket, then
> > execute
> > > > > those
> > > > > > scans in parallel. Your data probably doesn't distribute
> uniformly
> > > over
> > > > > > lexicographic buckets though so the scans are unlikely to execute
> > at
> > > a
> > > > > > constant rate and you'll get results in time proportional to the
> > > > > > lexicographic bucket with the highest cardinality in the region.
> > I'd
> > > be
> > > > > > interested to know if anyone on the list has ever tried this and
> > what
> > > > the
> > > > > > results were?
> > > > > >
> > > > > >
> > > > > > Using the much simpler approach of parallelising over regions by
> > > > creating
> > > > > > multiple disjoint scans client side, as suggested, your
> performance
> > > now
> > > > > > depends on your regions which you have some control over. You can
> > > > achieve
> > > > > > the same effect by pre-splitting your table such that you
> > empirically
> > > > > > optimise read performance for the dataset you store.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Richard
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Anil <anilk...@gmail.com>
> > > > > > Sent: 20 February 2017 12:35
> > > > > > To: user@hbase.apache.org
> > > > > > Subject: Re: Parallel Scanner
> > > > > >
> > > > > > Thanks Richard.
> > > > > >
> > > > > > I am able to get the regions for data to be loaded from table. I
> am
> > > > > trying
> > > > > > to scan a region in parallel :)
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > richardstar...@outlook.com>
> > > > > > wrote:
> > > > > >
> > > > > > > For a client only solution, have you looked at the
> RegionLocator
> > > > > > > interface? It gives you a list of pairs of byte[] (the start
> and
> > > stop
> > > > > > keys
> > > > > > > for each region). You can easily use a ForkJoinPool recursive
> > task
> > > or
> > > > > > java
> > > > > > > 8 parallel stream over that list. I implemented a spark RDD to
> do
> > > > that
> > > > > > and
> > > > > > > wrote about it with code samples here:
> > > > > > >
> > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > > >
> > > > > > > partitions-with-hbase-regions/
> > > > > > >
> > > > > > > Forget about the spark details in the post (and forget that
> > > > Hortonworks
> > > > > > > have a library to do the same thing :)) the idea of creating
> one
> > > scan
> > > > > per
> > > > > > > region and setting scan starts and stops from the region
> locator
> > > > would
> > > > > > give
> > > > > > > you a parallel scan. Note you can also group the scans by
> region
> > > > > server.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Richard
> > > > > > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > > > > > > lk...@gmail.com>> wrote:
> > > > > > >
> > > > > > > Thanks Ram. I will look into EndPoints.
> > > > > > >
> > > > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > > > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> > > > > vasude...@gmail.com
> > > > > > >>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Yes. There is way.
> > > > > > >
> > > > > > > Have you seen Endpoints? Endpoints are triggers like points
> that
> > > > allows
> > > > > > > your client to trigger them parallely in one ore more regions
> > using
> > > > the
> > > > > > > start and end key of the region. This executes parallely and
> then
> > > you
> > > > > may
> > > > > > > have to sort out the results as per your need.
> > > > > > >
> > > > > > > But these endpoints have to running on your region servers and
> it
> > > is
> > > > > not
> > > > > > a
> > > > > > > client only soln.
> > > > > > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > > > > > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > > > > > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> > > > > > entry/coprocessor_introduction>
> > > > > >
> > > > > > Coprocessor Introduction : Apache HBase<https://blogs.apache.
> > > > > > org/hbase/entry/coprocessor_introduction>
> > > > > > blogs.apache.org
> > > > > > Coprocessor Introduction. Authors: Trend Micro Hadoop Group:
> > Mingjie
> > > > Lai,
> > > > > > Eugene Koontz, Andrew Purtell (The original version of the blog
> was
> > > > > posted
> > > > > > at http ...
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Be careful when you use them. Since these endpoints run on
> server
> > > > > ensure
> > > > > > > that these are not heavy or things that consume more memory
> which
> > > can
> > > > > > have
> > > > > > > adverse effects on the server.
> > > > > > >
> > > > > > >
> > > > > > > Regards
> > > > > > > Ram
> > > > > > >
> > > > > > > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com
> > <mailto:
> > > > ani
> > > > > > > lk...@gmail.com>> wrote:
> > > > > > >
> > > > > > > Thanks Ram.
> > > > > > >
> > > > > > > So, you mean that there is no harm in using
> > > HTable#getRegionsInRange
> > > > > in
> > > > > > > the application code.
> > > > > > >
> > > > > > > HTable#getRegionsInRange returned single entry for all my
> region
> > > > start
> > > > > > > key
> > > > > > > and end key. i need to explore more on this.
> > > > > > >
> > > > > > > "If you know the table region's start and end keys you could
> > create
> > > > > > > parallel scans in your application code."  - is there any way
> to
> > > > scan a
> > > > > > > region in the application code other than the one i put in the
> > > > original
> > > > > > > email ?
> > > > > > >
> > > > > > > "One thing to watch out is that if there is a split in the
> region
> > > > then
> > > > > > > this start
> > > > > > > and end row may change so in that case it is better you try to
> > get
> > > > > > > the regions every time before you issue a scan"
> > > > > > > - Agree. i am dynamically determining the region start key and
> > end
> > > > key
> > > > > > > before initiating scan operations for every initial load.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 20 February 2017 at 10:59, ramkrishna vasudevan <
> > > > > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> > > > > vasude...@gmail.com
> > > > > > >>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Hi Anil,
> > > > > > >
> > > > > > > HBase directly does not provide parallel scans. If you know the
> > > table
> > > > > > > region's start and end keys you could create parallel scans in
> > your
> > > > > > > application code.
> > > > > > >
> > > > > > > In the above code snippet, the intent is right - you get the
> > > required
> > > > > > > regions and can issue parallel scans from your app.
> > > > > > >
> > > > > > > One thing to watch out is that if there is a split in the
> region
> > > then
> > > > > > > this
> > > > > > > start and end row may change so in that case it is better you
> try
> > > to
> > > > > > > get
> > > > > > > the regions every time before you issue a scan. Does that make
> > > sense
> > > > to
> > > > > > > you?
> > > > > > >
> > > > > > > Regards
> > > > > > > Ram
> > > > > > >
> > > > > > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com
> > <mailto:
> > > ani
> > > > > > > lk...@gmail.com>> wrote:
> > > > > > >
> > > > > > > Hi ,
> > > > > > >
> > > > > > > I am building an usecase where i have to load the hbase data
> into
> > > > > > > In-memory
> > > > > > > database (IMDB). I am scanning the each region and loading data
> > > into
> > > > > > > IMDB.
> > > > > > >
> > > > > > > i am looking at parallel scanner ( https://issues.apache.org/
> > > > > > issues.apache.org<https://issues.apache.org/>
> > > > > > issues.apache.org
> > > > > > issues.apache.org. Apache currently hosts two different issue
> > > tracking
> > > > > > systems, Bugzilla and Jira. To find out how to report an issue
> for
> > a
> > > > > > particular project ...
> > > > > >
> > > > > >
> > > > > >
> > > > > > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time
> and
> > > > > > > HTable#
> > > > > > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean
> reload)
> > > is
> > > > > > > deprecated, HBASE-1935 is still open.
> > > > > > >
> > > > > > > I see Connection from ConnectionFactory is
> > > HConnectionImplementation
> > > > > > > by
> > > > > > > default and creates HTable instance.
> > > > > > >
> > > > > > > Do you see any issues in using HTable from Table instance ?
> > > > > > >            for each region {
> > > > > > >                        int i = 0;
> > > > > > >                    List<HRegionLocation> regions =
> > > > > > > hTable.getRegionsInRange(scans.getStartRow(),
> > scans.getStopRow(),
> > > > > > > true);
> > > > > > >
> > > > > > >                    for (HRegionLocation region : regions){
> > > > > > >                    startRow = i == 0 ? scans.getStartRow() :
> > > > > > > region.getRegionInfo().getStartKey();
> > > > > > >                    i++;
> > > > > > >                    endRow = i == regions.size()?
> > scans.getStopRow()
> > > > > > > :
> > > > > > > region.getRegionInfo().getEndKey();
> > > > > > >                     }
> > > > > > >           }
> > > > > > >
> > > > > > > are there any alternatives to achieve parallel scan? Thanks.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Scanner

Reply via email to