Re: Parallel Scanner

2017-02-20 Thread Anil
t; > > > > > not scanning a single region in parallel? HBASE-1935 is
> about
> > > > > > > > parallelising
> > > > > > > > > scans over regions, not within regions.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If you want to parallelise within a region, you could
> write a
> > > > > little
> > > > > > > > > method to split the first and last key of the region into
> > > several
> > > > > > > > disjoint
> > > > > > > > > lexicographic buckets and create a scan for each bucket,
> then
> > > > > execute
> > > > > > > > those
> > > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > > uniformly
> > > > > > over
> > > > > > > > > lexicographic buckets though so the scans are unlikely to
> > > execute
> > > > > at
> > > > > > a
> > > > > > > > > constant rate and you'll get results in time proportional
> to
> > > the
> > > > > > > > > lexicographic bucket with the highest cardinality in the
> > > region.
> > > > > I'd
> > > > > > be
> > > > > > > > > interested to know if anyone on the list has ever tried
> this
> > > and
> > > > > what
> > > > > > > the
> > > > > > > > > results were?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Using the much simpler approach of parallelising over
> regions
> > > by
> > > > > > > creating
> > > > > > > > > multiple disjoint scans client side, as suggested, your
> > > > performance
> > > > > > now
> > > > > > > > > depends on your regions which you have some control over.
> You
> > > can
> > > > > > > achieve
> > > > > > > > > the same effect by pre-splitting your table such that you
> > > > > empirically
> > > > > > > > > optimise read performance for the dataset you store.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 
> > > > > > > > > From: Anil <anilk...@gmail.com>
> > > > > > > > > Sent: 20 February 2017 12:35
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > Subject: Re: Parallel Scanner
> > > > > > > > >
> > > > > > > > > Thanks Richard.
> > > > > > > > >
> > > > > > > > > I am able to get the regions for data to be loaded from
> > table.
> > > I
> > > > am
> > > > > > > > trying
> > > > > > > > > to scan a region in parallel :)
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > > > richardstar...@outlook.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For a client only solution, have you looked at the
> > > > RegionLocator
> > > > > > > > > > interface? It gives you a list of pairs of byte[] (the
> > start
> > > > and
> > > > > > stop
> > > > > > > > > keys
> > > > > > > > > > for each region). You can easily use a ForkJoinPool
> > recursive
> > > > > task
> > > > > > or
> > > > > > > > > java
> > > > > > > > > > 8 parallel stream over that list. I implemented a spark
> RDD
> > > to
> > > > do
> > > > > > > that
> > > > > > > > > and
> >

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
et, then
> > > > execute
> > > > > > > those
> > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > uniformly
> > > > > over
> > > > > > > > lexicographic buckets though so the scans are unlikely to
> > execute
> > > > at
> > > > > a
> > > > > > > > constant rate and you'll get results in time proportional to
> > the
> > > > > > > > lexicographic bucket with the highest cardinality in the
> > region.
> > > > I'd
> > > > > be
> > > > > > > > interested to know if anyone on the list has ever tried this
> > and
> > > > what
> > > > > > the
> > > > > > > > results were?
> > > > > > > >
> > > > > > > >
> > > > > > > > Using the much simpler approach of parallelising over regions
> > by
> > > > > > creating
> > > > > > > > multiple disjoint scans client side, as suggested, your
> > > performance
> > > > > now
> > > > > > > > depends on your regions which you have some control over. You
> > can
> > > > > > achieve
> > > > > > > > the same effect by pre-splitting your table such that you
> > > > empirically
> > > > > > > > optimise read performance for the dataset you store.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Richard
> > > > > > > >
> > > > > > > >
> > > > > > > > 
> > > > > > > > From: Anil <anilk...@gmail.com>
> > > > > > > > Sent: 20 February 2017 12:35
> > > > > > > > To: user@hbase.apache.org
> > > > > > > > Subject: Re: Parallel Scanner
> > > > > > > >
> > > > > > > > Thanks Richard.
> > > > > > > >
> > > > > > > > I am able to get the regions for data to be loaded from
> table.
> > I
> > > am
> > > > > > > trying
> > > > > > > > to scan a region in parallel :)
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > > richardstar...@outlook.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > For a client only solution, have you looked at the
> > > RegionLocator
> > > > > > > > > interface? It gives you a list of pairs of byte[] (the
> start
> > > and
> > > > > stop
> > > > > > > > keys
> > > > > > > > > for each region). You can easily use a ForkJoinPool
> recursive
> > > > task
> > > > > or
> > > > > > > > java
> > > > > > > > > 8 parallel stream over that list. I implemented a spark RDD
> > to
> > > do
> > > > > > that
> > > > > > > > and
> > > > > > > > > wrote about it with code samples here:
> > > > > > > > >
> > > > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > > > > >
> > > > > > > > > partitions-with-hbase-regions/
> > > > > > > > >
> > > > > > > > > Forget about the spark details in the post (and forget that
> > > > > > Hortonworks
> > > > > > > > > have a library to do the same thing :)) the idea of
> creating
> > > one
> > > > > scan
> > > > > > > per
> > > > > > > > > region and setting scan starts and stops from the region
> > > locator
> > > > > > would
> > > > > > > > give
> > > > > > > > > you a parallel scan. Note you can also group the scans by
> > > region
> > > > > > > server.
> > > > > > > >

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

its very difficult to predict the data distribution. we store parent to
child relationships in the table. (Note : A parent record is child for
itself )

we set the max hregion file size as 10gb. I don't think we have any control
on region size :(

Thanks


On 20 February 2017 at 21:24, Ted Yu <yuzhih...@gmail.com> wrote:

> Among the 5 columns, do you know roughly the data distribution ?
>
> You should put the columns whose data distribution is relatively even
> first. Of course, there may be business requirement which you take into
> consideration w.r.t. the composite key.
>
> If you cannot change the schema, do you have control over the region size ?
> Smaller region may lower the variance in data distribution per region.
>
> On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Current region size is 10 GB.
> >
> > Hbase row key designed like a phoenix primary key. I can say it is like 5
> > column composite key. Prefix for a common set of data would have same
> first
> > prefix. I am not sure how to convey the data distribution.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Anil:
> > > What's the current region size you use ?
> > >
> > > Given a region, do you have some idea how the data is distributed
> within
> > > the region ?
> > >
> > > Cheers
> > >
> > > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
> > >
> > > > i understand my original post now :)  Sorry about that.
> > > >
> > > > now the challenge is to split a start key and end key at client side
> to
> > > > allow parallel scans on table with no buckets, pre-salting.
> > > >
> > > > Thanks.
> > > >
> > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > >
> > > > > You are trying to scan one region itself in parallel, then even I
> got
> > > you
> > > > > wrong. Richard's suggestion is the right choice for client only
> soln.
> > > > >
> > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Richard :)
> > > > > >
> > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > richardstar...@outlook.com>
> > > > > > wrote:
> > > > > >
> > > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> > if
> > > > it's
> > > > > > > available in place of whatever is still available on HTable for
> > > your
> > > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > > HTable::getRegionsInRange no longer exists on the current
> master
> > > > > branch.
> > > > > > >
> > > > > > >
> > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > >
> > > > > > >
> > > > > > > I thought you were asking about scanning many regions at the
> same
> > > > time,
> > > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > > parallelising
> > > > > > > scans over regions, not within regions.
> > > > > > >
> > > > > > >
> > > > > > > If you want to parallelise within a region, you could write a
> > > little
> > > > > > > method to split the first and last key of the region into
> several
> > > > > > disjoint
> > > > > > > lexicographic buckets and create a scan for each bucket, then
> > > execute
> > > > > > those
> > > > > > > scans in parallel. Your data probably doesn't distribute
> > uniformly
> > > > over
> > > > > > > lexicographic buckets though so the scans are unlikely to
> execute
> > > at
> > > > a
> > > > > > > constant rate and you'll get results in time proportional to
> the
> > > > > > > lexicographic bucket with the highest cardinality in the
> region.
> > > I'd
> > > > be
> > > > > > > interested to know 

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Among the 5 columns, do you know roughly the data distribution ?

You should put the columns whose data distribution is relatively even
first. Of course, there may be business requirement which you take into
consideration w.r.t. the composite key.

If you cannot change the schema, do you have control over the region size ?
Smaller region may lower the variance in data distribution per region.

On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote:

> Hi Ted,
>
> Current region size is 10 GB.
>
> Hbase row key designed like a phoenix primary key. I can say it is like 5
> column composite key. Prefix for a common set of data would have same first
> prefix. I am not sure how to convey the data distribution.
>
> Thanks.
>
> On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Anil:
> > What's the current region size you use ?
> >
> > Given a region, do you have some idea how the data is distributed within
> > the region ?
> >
> > Cheers
> >
> > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
> >
> > > i understand my original post now :)  Sorry about that.
> > >
> > > now the challenge is to split a start key and end key at client side to
> > > allow parallel scans on table with no buckets, pre-salting.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com> wrote:
> > >
> > > > You are trying to scan one region itself in parallel, then even I got
> > you
> > > > wrong. Richard's suggestion is the right choice for client only soln.
> > > >
> > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > > >
> > > > > Thanks Richard :)
> > > > >
> > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> if
> > > it's
> > > > > > available in place of whatever is still available on HTable for
> > your
> > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > HTable::getRegionsInRange no longer exists on the current master
> > > > branch.
> > > > > >
> > > > > >
> > > > > > "I am trying to scan a region in parallel :)"
> > > > > >
> > > > > >
> > > > > > I thought you were asking about scanning many regions at the same
> > > time,
> > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > parallelising
> > > > > > scans over regions, not within regions.
> > > > > >
> > > > > >
> > > > > > If you want to parallelise within a region, you could write a
> > little
> > > > > > method to split the first and last key of the region into several
> > > > > disjoint
> > > > > > lexicographic buckets and create a scan for each bucket, then
> > execute
> > > > > those
> > > > > > scans in parallel. Your data probably doesn't distribute
> uniformly
> > > over
> > > > > > lexicographic buckets though so the scans are unlikely to execute
> > at
> > > a
> > > > > > constant rate and you'll get results in time proportional to the
> > > > > > lexicographic bucket with the highest cardinality in the region.
> > I'd
> > > be
> > > > > > interested to know if anyone on the list has ever tried this and
> > what
> > > > the
> > > > > > results were?
> > > > > >
> > > > > >
> > > > > > Using the much simpler approach of parallelising over regions by
> > > > creating
> > > > > > multiple disjoint scans client side, as suggested, your
> performance
> > > now
> > > > > > depends on your regions which you have some control over. You can
> > > > achieve
> > > > > > the same effect by pre-splitting your table such that you
> > empirically
> > > > > > optimise read performance for the dataset you store.
> > > > > >
> > > > > >
> > > > > > Thanks,
>

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

Current region size is 10 GB.

Hbase row key designed like a phoenix primary key. I can say it is like 5
column composite key. Prefix for a common set of data would have same first
prefix. I am not sure how to convey the data distribution.

Thanks.

On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:

> Anil:
> What's the current region size you use ?
>
> Given a region, do you have some idea how the data is distributed within
> the region ?
>
> Cheers
>
> On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
>
> > i understand my original post now :)  Sorry about that.
> >
> > now the challenge is to split a start key and end key at client side to
> > allow parallel scans on table with no buckets, pre-salting.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > You are trying to scan one region itself in parallel, then even I got
> you
> > > wrong. Richard's suggestion is the right choice for client only soln.
> > >
> > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > >
> > > > Thanks Richard :)
> > > >
> > > > On 20 February 2017 at 18:56, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > RegionLocator is not deprecated, hence the suggestion to use it if
> > it's
> > > > > available in place of whatever is still available on HTable for
> your
> > > > > version of HBase - it will make upgrades easier. For instance
> > > > > HTable::getRegionsInRange no longer exists on the current master
> > > branch.
> > > > >
> > > > >
> > > > > "I am trying to scan a region in parallel :)"
> > > > >
> > > > >
> > > > > I thought you were asking about scanning many regions at the same
> > time,
> > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > parallelising
> > > > > scans over regions, not within regions.
> > > > >
> > > > >
> > > > > If you want to parallelise within a region, you could write a
> little
> > > > > method to split the first and last key of the region into several
> > > > disjoint
> > > > > lexicographic buckets and create a scan for each bucket, then
> execute
> > > > those
> > > > > scans in parallel. Your data probably doesn't distribute uniformly
> > over
> > > > > lexicographic buckets though so the scans are unlikely to execute
> at
> > a
> > > > > constant rate and you'll get results in time proportional to the
> > > > > lexicographic bucket with the highest cardinality in the region.
> I'd
> > be
> > > > > interested to know if anyone on the list has ever tried this and
> what
> > > the
> > > > > results were?
> > > > >
> > > > >
> > > > > Using the much simpler approach of parallelising over regions by
> > > creating
> > > > > multiple disjoint scans client side, as suggested, your performance
> > now
> > > > > depends on your regions which you have some control over. You can
> > > achieve
> > > > > the same effect by pre-splitting your table such that you
> empirically
> > > > > optimise read performance for the dataset you store.
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > > 
> > > > > From: Anil <anilk...@gmail.com>
> > > > > Sent: 20 February 2017 12:35
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: Parallel Scanner
> > > > >
> > > > > Thanks Richard.
> > > > >
> > > > > I am able to get the regions for data to be loaded from table. I am
> > > > trying
> > > > > to scan a region in parallel :)
> > > > >
> > > > > Thanks
> > > > >
> > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > For a client only solution, ha

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Anil:
What's the current region size you use ?

Given a region, do you have some idea how the data is distributed within
the region ?

Cheers

On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:

> i understand my original post now :)  Sorry about that.
>
> now the challenge is to split a start key and end key at client side to
> allow parallel scans on table with no buckets, pre-salting.
>
> Thanks.
>
> On 20 February 2017 at 20:21, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > You are trying to scan one region itself in parallel, then even I got you
> > wrong. Richard's suggestion is the right choice for client only soln.
> >
> > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> >
> > > Thanks Richard :)
> > >
> > > On 20 February 2017 at 18:56, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > RegionLocator is not deprecated, hence the suggestion to use it if
> it's
> > > > available in place of whatever is still available on HTable for your
> > > > version of HBase - it will make upgrades easier. For instance
> > > > HTable::getRegionsInRange no longer exists on the current master
> > branch.
> > > >
> > > >
> > > > "I am trying to scan a region in parallel :)"
> > > >
> > > >
> > > > I thought you were asking about scanning many regions at the same
> time,
> > > > not scanning a single region in parallel? HBASE-1935 is about
> > > parallelising
> > > > scans over regions, not within regions.
> > > >
> > > >
> > > > If you want to parallelise within a region, you could write a little
> > > > method to split the first and last key of the region into several
> > > disjoint
> > > > lexicographic buckets and create a scan for each bucket, then execute
> > > those
> > > > scans in parallel. Your data probably doesn't distribute uniformly
> over
> > > > lexicographic buckets though so the scans are unlikely to execute at
> a
> > > > constant rate and you'll get results in time proportional to the
> > > > lexicographic bucket with the highest cardinality in the region. I'd
> be
> > > > interested to know if anyone on the list has ever tried this and what
> > the
> > > > results were?
> > > >
> > > >
> > > > Using the much simpler approach of parallelising over regions by
> > creating
> > > > multiple disjoint scans client side, as suggested, your performance
> now
> > > > depends on your regions which you have some control over. You can
> > achieve
> > > > the same effect by pre-splitting your table such that you empirically
> > > > optimise read performance for the dataset you store.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Richard
> > > >
> > > >
> > > > 
> > > > From: Anil <anilk...@gmail.com>
> > > > Sent: 20 February 2017 12:35
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Parallel Scanner
> > > >
> > > > Thanks Richard.
> > > >
> > > > I am able to get the regions for data to be loaded from table. I am
> > > trying
> > > > to scan a region in parallel :)
> > > >
> > > > Thanks
> > > >
> > > > On 20 February 2017 at 16:44, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > For a client only solution, have you looked at the RegionLocator
> > > > > interface? It gives you a list of pairs of byte[] (the start and
> stop
> > > > keys
> > > > > for each region). You can easily use a ForkJoinPool recursive task
> or
> > > > java
> > > > > 8 parallel stream over that list. I implemented a spark RDD to do
> > that
> > > > and
> > > > > wrote about it with code samples here:
> > > > >
> > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > >
> > > > > partitions-with-hbase-regions/
> > > > >
> > > > > Forget about the spark details in the post (and forget that
> > Hortonworks
> > > > > have a lib

Re: Parallel Scanner

2017-02-20 Thread Anil
i understand my original post now :)  Sorry about that.

now the challenge is to split a start key and end key at client side to
allow parallel scans on table with no buckets, pre-salting.

Thanks.

On 20 February 2017 at 20:21, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> You are trying to scan one region itself in parallel, then even I got you
> wrong. Richard's suggestion is the right choice for client only soln.
>
> On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
>
> > Thanks Richard :)
> >
> > On 20 February 2017 at 18:56, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > > available in place of whatever is still available on HTable for your
> > > version of HBase - it will make upgrades easier. For instance
> > > HTable::getRegionsInRange no longer exists on the current master
> branch.
> > >
> > >
> > > "I am trying to scan a region in parallel :)"
> > >
> > >
> > > I thought you were asking about scanning many regions at the same time,
> > > not scanning a single region in parallel? HBASE-1935 is about
> > parallelising
> > > scans over regions, not within regions.
> > >
> > >
> > > If you want to parallelise within a region, you could write a little
> > > method to split the first and last key of the region into several
> > disjoint
> > > lexicographic buckets and create a scan for each bucket, then execute
> > those
> > > scans in parallel. Your data probably doesn't distribute uniformly over
> > > lexicographic buckets though so the scans are unlikely to execute at a
> > > constant rate and you'll get results in time proportional to the
> > > lexicographic bucket with the highest cardinality in the region. I'd be
> > > interested to know if anyone on the list has ever tried this and what
> the
> > > results were?
> > >
> > >
> > > Using the much simpler approach of parallelising over regions by
> creating
> > > multiple disjoint scans client side, as suggested, your performance now
> > > depends on your regions which you have some control over. You can
> achieve
> > > the same effect by pre-splitting your table such that you empirically
> > > optimise read performance for the dataset you store.
> > >
> > >
> > > Thanks,
> > >
> > > Richard
> > >
> > >
> > > 
> > > From: Anil <anilk...@gmail.com>
> > > Sent: 20 February 2017 12:35
> > > To: user@hbase.apache.org
> > > Subject: Re: Parallel Scanner
> > >
> > > Thanks Richard.
> > >
> > > I am able to get the regions for data to be loaded from table. I am
> > trying
> > > to scan a region in parallel :)
> > >
> > > Thanks
> > >
> > > On 20 February 2017 at 16:44, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > For a client only solution, have you looked at the RegionLocator
> > > > interface? It gives you a list of pairs of byte[] (the start and stop
> > > keys
> > > > for each region). You can easily use a ForkJoinPool recursive task or
> > > java
> > > > 8 parallel stream over that list. I implemented a spark RDD to do
> that
> > > and
> > > > wrote about it with code samples here:
> > > >
> > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > >
> > > > partitions-with-hbase-regions/
> > > >
> > > > Forget about the spark details in the post (and forget that
> Hortonworks
> > > > have a library to do the same thing :)) the idea of creating one scan
> > per
> > > > region and setting scan starts and stops from the region locator
> would
> > > give
> > > > you a parallel scan. Note you can also group the scans by region
> > server.
> > > >
> > > > Cheers,
> > > > Richard
> > > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > > > lk...@gmail.com>> wrote:
> > > >
> > > > Thanks Ram. I will look into EndPoints.
> > > >
> > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> > vasude...@gmail.com
>

Re: Parallel Scanner

2017-02-20 Thread ramkrishna vasudevan
You are trying to scan one region itself in parallel, then even I got you
wrong. Richard's suggestion is the right choice for client only soln.

On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:

> Thanks Richard :)
>
> On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com>
> wrote:
>
> > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > available in place of whatever is still available on HTable for your
> > version of HBase - it will make upgrades easier. For instance
> > HTable::getRegionsInRange no longer exists on the current master branch.
> >
> >
> > "I am trying to scan a region in parallel :)"
> >
> >
> > I thought you were asking about scanning many regions at the same time,
> > not scanning a single region in parallel? HBASE-1935 is about
> parallelising
> > scans over regions, not within regions.
> >
> >
> > If you want to parallelise within a region, you could write a little
> > method to split the first and last key of the region into several
> disjoint
> > lexicographic buckets and create a scan for each bucket, then execute
> those
> > scans in parallel. Your data probably doesn't distribute uniformly over
> > lexicographic buckets though so the scans are unlikely to execute at a
> > constant rate and you'll get results in time proportional to the
> > lexicographic bucket with the highest cardinality in the region. I'd be
> > interested to know if anyone on the list has ever tried this and what the
> > results were?
> >
> >
> > Using the much simpler approach of parallelising over regions by creating
> > multiple disjoint scans client side, as suggested, your performance now
> > depends on your regions which you have some control over. You can achieve
> > the same effect by pre-splitting your table such that you empirically
> > optimise read performance for the dataset you store.
> >
> >
> > Thanks,
> >
> > Richard
> >
> >
> > 
> > From: Anil <anilk...@gmail.com>
> > Sent: 20 February 2017 12:35
> > To: user@hbase.apache.org
> > Subject: Re: Parallel Scanner
> >
> > Thanks Richard.
> >
> > I am able to get the regions for data to be loaded from table. I am
> trying
> > to scan a region in parallel :)
> >
> > Thanks
> >
> > On 20 February 2017 at 16:44, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > For a client only solution, have you looked at the RegionLocator
> > > interface? It gives you a list of pairs of byte[] (the start and stop
> > keys
> > > for each region). You can easily use a ForkJoinPool recursive task or
> > java
> > > 8 parallel stream over that list. I implemented a spark RDD to do that
> > and
> > > wrote about it with code samples here:
> > >
> > > https://richardstartin.com/2016/11/07/co-locating-spark-
> >
> > > partitions-with-hbase-regions/
> > >
> > > Forget about the spark details in the post (and forget that Hortonworks
> > > have a library to do the same thing :)) the idea of creating one scan
> per
> > > region and setting scan starts and stops from the region locator would
> > give
> > > you a parallel scan. Note you can also group the scans by region
> server.
> > >
> > > Cheers,
> > > Richard
> > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > > lk...@gmail.com>> wrote:
> > >
> > > Thanks Ram. I will look into EndPoints.
> > >
> > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> vasude...@gmail.com
> > >>
> > > wrote:
> > >
> > > Yes. There is way.
> > >
> > > Have you seen Endpoints? Endpoints are triggers like points that allows
> > > your client to trigger them parallely in one ore more regions using the
> > > start and end key of the region. This executes parallely and then you
> may
> > > have to sort out the results as per your need.
> > >
> > > But these endpoints have to running on your region servers and it is
> not
> > a
> > > client only soln.
> > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> > entry/coprocessor_in

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard :)

On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com>
wrote:

> RegionLocator is not deprecated, hence the suggestion to use it if it's
> available in place of whatever is still available on HTable for your
> version of HBase - it will make upgrades easier. For instance
> HTable::getRegionsInRange no longer exists on the current master branch.
>
>
> "I am trying to scan a region in parallel :)"
>
>
> I thought you were asking about scanning many regions at the same time,
> not scanning a single region in parallel? HBASE-1935 is about parallelising
> scans over regions, not within regions.
>
>
> If you want to parallelise within a region, you could write a little
> method to split the first and last key of the region into several disjoint
> lexicographic buckets and create a scan for each bucket, then execute those
> scans in parallel. Your data probably doesn't distribute uniformly over
> lexicographic buckets though so the scans are unlikely to execute at a
> constant rate and you'll get results in time proportional to the
> lexicographic bucket with the highest cardinality in the region. I'd be
> interested to know if anyone on the list has ever tried this and what the
> results were?
>
>
> Using the much simpler approach of parallelising over regions by creating
> multiple disjoint scans client side, as suggested, your performance now
> depends on your regions which you have some control over. You can achieve
> the same effect by pre-splitting your table such that you empirically
> optimise read performance for the dataset you store.
>
>
> Thanks,
>
> Richard
>
>
> ____________
> From: Anil <anilk...@gmail.com>
> Sent: 20 February 2017 12:35
> To: user@hbase.apache.org
> Subject: Re: Parallel Scanner
>
> Thanks Richard.
>
> I am able to get the regions for data to be loaded from table. I am trying
> to scan a region in parallel :)
>
> Thanks
>
> On 20 February 2017 at 16:44, Richard Startin <richardstar...@outlook.com>
> wrote:
>
> > For a client only solution, have you looked at the RegionLocator
> > interface? It gives you a list of pairs of byte[] (the start and stop
> keys
> > for each region). You can easily use a ForkJoinPool recursive task or
> java
> > 8 parallel stream over that list. I implemented a spark RDD to do that
> and
> > wrote about it with code samples here:
> >
> > https://richardstartin.com/2016/11/07/co-locating-spark-
>
> > partitions-with-hbase-regions/
> >
> > Forget about the spark details in the post (and forget that Hortonworks
> > have a library to do the same thing :)) the idea of creating one scan per
> > region and setting scan starts and stops from the region locator would
> give
> > you a parallel scan. Note you can also group the scans by region server.
> >
> > Cheers,
> > Richard
> > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Thanks Ram. I will look into EndPoints.
> >
> > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com
> >>
> > wrote:
> >
> > Yes. There is way.
> >
> > Have you seen Endpoints? Endpoints are triggers like points that allows
> > your client to trigger them parallely in one ore more regions using the
> > start and end key of the region. This executes parallely and then you may
> > have to sort out the results as per your need.
> >
> > But these endpoints have to running on your region servers and it is not
> a
> > client only soln.
> > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> entry/coprocessor_introduction>
>
> Coprocessor Introduction : Apache HBase<https://blogs.apache.
> org/hbase/entry/coprocessor_introduction>
> blogs.apache.org
> Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> Eugene Koontz, Andrew Purtell (The original version of the blog was posted
> at http ...
>
>
>
> >
> > Be careful when you use them. Since these endpoints run on server ensure
> > that these are not heavy or things that consume more memory which can
> have
> > adverse effects on the server.
> >
> >
> > Regards
> > Ram
> >
> > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Th

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard.

I am able to get the regions for data to be loaded from table. I am trying
to scan a region in parallel :)

Thanks

On 20 February 2017 at 16:44, Richard Startin 
wrote:

> For a client only solution, have you looked at the RegionLocator
> interface? It gives you a list of pairs of byte[] (the start and stop keys
> for each region). You can easily use a ForkJoinPool recursive task or java
> 8 parallel stream over that list. I implemented a spark RDD to do that and
> wrote about it with code samples here:
>
> https://richardstartin.com/2016/11/07/co-locating-spark-
> partitions-with-hbase-regions/
>
> Forget about the spark details in the post (and forget that Hortonworks
> have a library to do the same thing :)) the idea of creating one scan per
> region and setting scan starts and stops from the region locator would give
> you a parallel scan. Note you can also group the scans by region server.
>
> Cheers,
> Richard
> On 20 Feb 2017, at 07:33, Anil  lk...@gmail.com>> wrote:
>
> Thanks Ram. I will look into EndPoints.
>
> On 20 February 2017 at 12:29, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com>
> wrote:
>
> Yes. There is way.
>
> Have you seen Endpoints? Endpoints are triggers like points that allows
> your client to trigger them parallely in one ore more regions using the
> start and end key of the region. This executes parallely and then you may
> have to sort out the results as per your need.
>
> But these endpoints have to running on your region servers and it is not a
> client only soln.
> https://blogs.apache.org/hbase/entry/coprocessor_introduction.
>
> Be careful when you use them. Since these endpoints run on server ensure
> that these are not heavy or things that consume more memory which can have
> adverse effects on the server.
>
>
> Regards
> Ram
>
> On Mon, Feb 20, 2017 at 12:18 PM, Anil  lk...@gmail.com>> wrote:
>
> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start
> key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
> - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com>
> wrote:
>
> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then
> this
> start and end row may change so in that case it is better you try to
> get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil  lk...@gmail.com>> wrote:
>
> Hi ,
>
> I am building an usecase where i have to load the hbase data into
> In-memory
> database (IMDB). I am scanning the each region and loading data into
> IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation
> by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
>for each region {
>int i = 0;
>List regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
>
>for (HRegionLocation region : regions){
>startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
>i++;
>endRow = i == regions.size()? scans.getStopRow()
> :
> region.getRegionInfo().getEndKey();
> }
>   }
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>
>
>
>
>


Re: Parallel Scanner

2017-02-20 Thread Richard Startin
For a client only solution, have you looked at the RegionLocator interface? It 
gives you a list of pairs of byte[] (the start and stop keys for each region). 
You can easily use a ForkJoinPool recursive task or java 8 parallel stream over 
that list. I implemented a spark RDD to do that and wrote about it with code 
samples here:

https://richardstartin.com/2016/11/07/co-locating-spark-partitions-with-hbase-regions/

Forget about the spark details in the post (and forget that Hortonworks have a 
library to do the same thing :)) the idea of creating one scan per region and 
setting scan starts and stops from the region locator would give you a parallel 
scan. Note you can also group the scans by region server.

Cheers,
Richard
On 20 Feb 2017, at 07:33, Anil > 
wrote:

Thanks Ram. I will look into EndPoints.

On 20 February 2017 at 12:29, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil 
> wrote:

Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start
key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
- Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then
this
start and end row may change so in that case it is better you try to
get
the regions every time before you issue a scan. Does that make sense to
you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil 
> wrote:

Hi ,

I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation
by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
   for each region {
   int i = 0;
   List regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
true);

   for (HRegionLocation region : regions){
   startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
   i++;
   endRow = i == regions.size()? scans.getStopRow()
:
region.getRegionInfo().getEndKey();
}
  }

are there any alternatives to achieve parallel scan? Thanks.

Thanks






Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil  wrote:

> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
>  - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > Hi Anil,
> >
> > HBase directly does not provide parallel scans. If you know the table
> > region's start and end keys you could create parallel scans in your
> > application code.
> >
> > In the above code snippet, the intent is right - you get the required
> > regions and can issue parallel scans from your app.
> >
> > One thing to watch out is that if there is a split in the region then
> this
> > start and end row may change so in that case it is better you try to get
> > the regions every time before you issue a scan. Does that make sense to
> > you?
> >
> > Regards
> > Ram
> >
> > On Sat, Feb 18, 2017 at 1:44 PM, Anil  wrote:
> >
> > > Hi ,
> > >
> > > I am building an usecase where i have to load the hbase data into
> > In-memory
> > > database (IMDB). I am scanning the each region and loading data into
> > IMDB.
> > >
> > > i am looking at parallel scanner ( https://issues.apache.org/
> > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > > deprecated, HBASE-1935 is still open.
> > >
> > > I see Connection from ConnectionFactory is HConnectionImplementation by
> > > default and creates HTable instance.
> > >
> > > Do you see any issues in using HTable from Table instance ?
> > > for each region {
> > > int i = 0;
> > > List regions =
> > > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
> > >
> > > for (HRegionLocation region : regions){
> > > startRow = i == 0 ? scans.getStartRow() :
> > > region.getRegionInfo().getStartKey();
> > > i++;
> > > endRow = i == regions.size()? scans.getStopRow() :
> > > region.getRegionInfo().getEndKey();
> > >  }
> > >}
> > >
> > > are there any alternatives to achieve parallel scan? Thanks.
> > >
> > > Thanks
> > >
> >
>


Re: Parallel Scanner

2017-02-19 Thread Anil
Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
 - Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then this
> start and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil  wrote:
>
> > Hi ,
> >
> > I am building an usecase where i have to load the hbase data into
> In-memory
> > database (IMDB). I am scanning the each region and loading data into
> IMDB.
> >
> > i am looking at parallel scanner ( https://issues.apache.org/
> > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable#
> > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > deprecated, HBASE-1935 is still open.
> >
> > I see Connection from ConnectionFactory is HConnectionImplementation by
> > default and creates HTable instance.
> >
> > Do you see any issues in using HTable from Table instance ?
> > for each region {
> > int i = 0;
> > List regions =
> > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), true);
> >
> > for (HRegionLocation region : regions){
> > startRow = i == 0 ? scans.getStartRow() :
> > region.getRegionInfo().getStartKey();
> > i++;
> > endRow = i == regions.size()? scans.getStopRow() :
> > region.getRegionInfo().getEndKey();
> >  }
> >}
> >
> > are there any alternatives to achieve parallel scan? Thanks.
> >
> > Thanks
> >
>


Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then this
start and end row may change so in that case it is better you try to get
the regions every time before you issue a scan. Does that make sense to you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil  wrote:

> Hi ,
>
> I am building an usecase where i have to load the hbase data into In-memory
> database (IMDB). I am scanning the each region and loading data into IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
> for each region {
> int i = 0;
> List regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), true);
>
> for (HRegionLocation region : regions){
> startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
> i++;
> endRow = i == regions.size()? scans.getStopRow() :
> region.getRegionInfo().getEndKey();
>  }
>}
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>