Hi Ted, Thanks. I will go through phoenix code.
Thanks. On 20 February 2017 at 21:50, Ted Yu <yuzhih...@gmail.com> wrote: > Please read https://phoenix.apache.org/update_statistics.html > > FYI > > On Mon, Feb 20, 2017 at 8:14 AM, Anil <anilk...@gmail.com> wrote: > > > Hi Ted, > > > > its very difficult to predict the data distribution. we store parent to > > child relationships in the table. (Note : A parent record is child for > > itself ) > > > > we set the max hregion file size as 10gb. I don't think we have any > control > > on region size :( > > > > Thanks > > > > > > On 20 February 2017 at 21:24, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > Among the 5 columns, do you know roughly the data distribution ? > > > > > > You should put the columns whose data distribution is relatively even > > > first. Of course, there may be business requirement which you take into > > > consideration w.r.t. the composite key. > > > > > > If you cannot change the schema, do you have control over the region > > size ? > > > Smaller region may lower the variance in data distribution per region. > > > > > > On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote: > > > > > > > Hi Ted, > > > > > > > > Current region size is 10 GB. > > > > > > > > Hbase row key designed like a phoenix primary key. I can say it is > > like 5 > > > > column composite key. Prefix for a common set of data would have same > > > first > > > > prefix. I am not sure how to convey the data distribution. > > > > > > > > Thanks. > > > > > > > > On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > > > Anil: > > > > > What's the current region size you use ? > > > > > > > > > > Given a region, do you have some idea how the data is distributed > > > within > > > > > the region ? > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote: > > > > > > > > > > > i understand my original post now :) Sorry about that. > > > > > > > > > > > > now the challenge is to split a start key and end key at client > > side > > > to > > > > > > allow parallel scans on table with no buckets, pre-salting. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > On 20 February 2017 at 20:21, ramkrishna vasudevan < > > > > > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > > > > > > > > > You are trying to scan one region itself in parallel, then > even I > > > got > > > > > you > > > > > > > wrong. Richard's suggestion is the right choice for client only > > > soln. > > > > > > > > > > > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> > > wrote: > > > > > > > > > > > > > > > Thanks Richard :) > > > > > > > > > > > > > > > > On 20 February 2017 at 18:56, Richard Startin < > > > > > > > richardstar...@outlook.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > RegionLocator is not deprecated, hence the suggestion to > use > > it > > > > if > > > > > > it's > > > > > > > > > available in place of whatever is still available on HTable > > for > > > > > your > > > > > > > > > version of HBase - it will make upgrades easier. For > instance > > > > > > > > > HTable::getRegionsInRange no longer exists on the current > > > master > > > > > > > branch. > > > > > > > > > > > > > > > > > > > > > > > > > > > "I am trying to scan a region in parallel :)" > > > > > > > > > > > > > > > > > > > > > > > > > > > I thought you were asking about scanning many regions at > the > > > same > > > > > > time, > > > > > > > > > not scanning a single region in parallel? HBASE-1935 is > about > > > > > > > > parallelising > > > > > > > > > scans over regions, not within regions. > > > > > > > > > > > > > > > > > > > > > > > > > > > If you want to parallelise within a region, you could > write a > > > > > little > > > > > > > > > method to split the first and last key of the region into > > > several > > > > > > > > disjoint > > > > > > > > > lexicographic buckets and create a scan for each bucket, > then > > > > > execute > > > > > > > > those > > > > > > > > > scans in parallel. Your data probably doesn't distribute > > > > uniformly > > > > > > over > > > > > > > > > lexicographic buckets though so the scans are unlikely to > > > execute > > > > > at > > > > > > a > > > > > > > > > constant rate and you'll get results in time proportional > to > > > the > > > > > > > > > lexicographic bucket with the highest cardinality in the > > > region. > > > > > I'd > > > > > > be > > > > > > > > > interested to know if anyone on the list has ever tried > this > > > and > > > > > what > > > > > > > the > > > > > > > > > results were? > > > > > > > > > > > > > > > > > > > > > > > > > > > Using the much simpler approach of parallelising over > regions > > > by > > > > > > > creating > > > > > > > > > multiple disjoint scans client side, as suggested, your > > > > performance > > > > > > now > > > > > > > > > depends on your regions which you have some control over. > You > > > can > > > > > > > achieve > > > > > > > > > the same effect by pre-splitting your table such that you > > > > > empirically > > > > > > > > > optimise read performance for the dataset you store. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > > From: Anil <anilk...@gmail.com> > > > > > > > > > Sent: 20 February 2017 12:35 > > > > > > > > > To: user@hbase.apache.org > > > > > > > > > Subject: Re: Parallel Scanner > > > > > > > > > > > > > > > > > > Thanks Richard. > > > > > > > > > > > > > > > > > > I am able to get the regions for data to be loaded from > > table. > > > I > > > > am > > > > > > > > trying > > > > > > > > > to scan a region in parallel :) > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > On 20 February 2017 at 16:44, Richard Startin < > > > > > > > > richardstar...@outlook.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > For a client only solution, have you looked at the > > > > RegionLocator > > > > > > > > > > interface? It gives you a list of pairs of byte[] (the > > start > > > > and > > > > > > stop > > > > > > > > > keys > > > > > > > > > > for each region). You can easily use a ForkJoinPool > > recursive > > > > > task > > > > > > or > > > > > > > > > java > > > > > > > > > > 8 parallel stream over that list. I implemented a spark > RDD > > > to > > > > do > > > > > > > that > > > > > > > > > and > > > > > > > > > > wrote about it with code samples here: > > > > > > > > > > > > > > > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark- > > > > > > > > > > > > > > > > > > > partitions-with-hbase-regions/ > > > > > > > > > > > > > > > > > > > > Forget about the spark details in the post (and forget > that > > > > > > > Hortonworks > > > > > > > > > > have a library to do the same thing :)) the idea of > > creating > > > > one > > > > > > scan > > > > > > > > per > > > > > > > > > > region and setting scan starts and stops from the region > > > > locator > > > > > > > would > > > > > > > > > give > > > > > > > > > > you a parallel scan. Note you can also group the scans by > > > > region > > > > > > > > server. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Richard > > > > > > > > > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com > <mailto: > > > ani > > > > > > > > > > lk...@gmail.com>> wrote: > > > > > > > > > > > > > > > > > > > > Thanks Ram. I will look into EndPoints. > > > > > > > > > > > > > > > > > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan < > > > > > > > > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s. > > > > > > > > vasude...@gmail.com > > > > > > > > > >> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Yes. There is way. > > > > > > > > > > > > > > > > > > > > Have you seen Endpoints? Endpoints are triggers like > points > > > > that > > > > > > > allows > > > > > > > > > > your client to trigger them parallely in one ore more > > regions > > > > > using > > > > > > > the > > > > > > > > > > start and end key of the region. This executes parallely > > and > > > > then > > > > > > you > > > > > > > > may > > > > > > > > > > have to sort out the results as per your need. > > > > > > > > > > > > > > > > > > > > But these endpoints have to running on your region > servers > > > and > > > > it > > > > > > is > > > > > > > > not > > > > > > > > > a > > > > > > > > > > client only soln. > > > > > > > > > > https://blogs.apache.org/hbase/entry/coprocessor_ > > > introduction. > > > > > > > > > [https://blogs.apache.org/hbase/mediaresource/60b135e5- > > > > > > > > > 04c6-4197-b262-e7cd08de784b]<h > ttps://blogs.apache.org/hbase/ > > > > > > > > > entry/coprocessor_introduction> > > > > > > > > > > > > > > > > > > Coprocessor Introduction : Apache HBase< > https://blogs.apache > > . > > > > > > > > > org/hbase/entry/coprocessor_introduction> > > > > > > > > > blogs.apache.org > > > > > > > > > Coprocessor Introduction. Authors: Trend Micro Hadoop > Group: > > > > > Mingjie > > > > > > > Lai, > > > > > > > > > Eugene Koontz, Andrew Purtell (The original version of the > > blog > > > > was > > > > > > > > posted > > > > > > > > > at http ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Be careful when you use them. Since these endpoints run > on > > > > server > > > > > > > > ensure > > > > > > > > > > that these are not heavy or things that consume more > memory > > > > which > > > > > > can > > > > > > > > > have > > > > > > > > > > adverse effects on the server. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > Ram > > > > > > > > > > > > > > > > > > > > On Mon, Feb 20, 2017 at 12:18 PM, Anil < > anilk...@gmail.com > > > > > <mailto: > > > > > > > ani > > > > > > > > > > lk...@gmail.com>> wrote: > > > > > > > > > > > > > > > > > > > > Thanks Ram. > > > > > > > > > > > > > > > > > > > > So, you mean that there is no harm in using > > > > > > HTable#getRegionsInRange > > > > > > > > in > > > > > > > > > > the application code. > > > > > > > > > > > > > > > > > > > > HTable#getRegionsInRange returned single entry for all my > > > > region > > > > > > > start > > > > > > > > > > key > > > > > > > > > > and end key. i need to explore more on this. > > > > > > > > > > > > > > > > > > > > "If you know the table region's start and end keys you > > could > > > > > create > > > > > > > > > > parallel scans in your application code." - is there any > > way > > > > to > > > > > > > scan a > > > > > > > > > > region in the application code other than the one i put > in > > > the > > > > > > > original > > > > > > > > > > email ? > > > > > > > > > > > > > > > > > > > > "One thing to watch out is that if there is a split in > the > > > > region > > > > > > > then > > > > > > > > > > this start > > > > > > > > > > and end row may change so in that case it is better you > try > > > to > > > > > get > > > > > > > > > > the regions every time before you issue a scan" > > > > > > > > > > - Agree. i am dynamically determining the region start > key > > > and > > > > > end > > > > > > > key > > > > > > > > > > before initiating scan operations for every initial load. > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 20 February 2017 at 10:59, ramkrishna vasudevan < > > > > > > > > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s. > > > > > > > > vasude...@gmail.com > > > > > > > > > >> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi Anil, > > > > > > > > > > > > > > > > > > > > HBase directly does not provide parallel scans. If you > know > > > the > > > > > > table > > > > > > > > > > region's start and end keys you could create parallel > scans > > > in > > > > > your > > > > > > > > > > application code. > > > > > > > > > > > > > > > > > > > > In the above code snippet, the intent is right - you get > > the > > > > > > required > > > > > > > > > > regions and can issue parallel scans from your app. > > > > > > > > > > > > > > > > > > > > One thing to watch out is that if there is a split in the > > > > region > > > > > > then > > > > > > > > > > this > > > > > > > > > > start and end row may change so in that case it is better > > you > > > > try > > > > > > to > > > > > > > > > > get > > > > > > > > > > the regions every time before you issue a scan. Does that > > > make > > > > > > sense > > > > > > > to > > > > > > > > > > you? > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > Ram > > > > > > > > > > > > > > > > > > > > On Sat, Feb 18, 2017 at 1:44 PM, Anil < > anilk...@gmail.com > > > > > <mailto: > > > > > > ani > > > > > > > > > > lk...@gmail.com>> wrote: > > > > > > > > > > > > > > > > > > > > Hi , > > > > > > > > > > > > > > > > > > > > I am building an usecase where i have to load the hbase > > data > > > > into > > > > > > > > > > In-memory > > > > > > > > > > database (IMDB). I am scanning the each region and > loading > > > data > > > > > > into > > > > > > > > > > IMDB. > > > > > > > > > > > > > > > > > > > > i am looking at parallel scanner ( > > > https://issues.apache.org/ > > > > > > > > > issues.apache.org<https://issues.apache.org/> > > > > > > > > > issues.apache.org > > > > > > > > > issues.apache.org. Apache currently hosts two different > > issue > > > > > > tracking > > > > > > > > > systems, Bugzilla and Jira. To find out how to report an > > issue > > > > for > > > > > a > > > > > > > > > particular project ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load > > time > > > > and > > > > > > > > > > HTable# > > > > > > > > > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean > > > > reload) > > > > > > is > > > > > > > > > > deprecated, HBASE-1935 is still open. > > > > > > > > > > > > > > > > > > > > I see Connection from ConnectionFactory is > > > > > > HConnectionImplementation > > > > > > > > > > by > > > > > > > > > > default and creates HTable instance. > > > > > > > > > > > > > > > > > > > > Do you see any issues in using HTable from Table > instance ? > > > > > > > > > > for each region { > > > > > > > > > > int i = 0; > > > > > > > > > > List<HRegionLocation> regions = > > > > > > > > > > hTable.getRegionsInRange(scans.getStartRow(), > > > > > scans.getStopRow(), > > > > > > > > > > true); > > > > > > > > > > > > > > > > > > > > for (HRegionLocation region : > regions){ > > > > > > > > > > startRow = i == 0 ? > scans.getStartRow() > > : > > > > > > > > > > region.getRegionInfo().getStartKey(); > > > > > > > > > > i++; > > > > > > > > > > endRow = i == regions.size()? > > > > > scans.getStopRow() > > > > > > > > > > : > > > > > > > > > > region.getRegionInfo().getEndKey(); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > are there any alternatives to achieve parallel scan? > > Thanks. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >