Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted, Thanks. I will go through phoenix code. Thanks. On 20 February 2017 at 21:50, Ted Yu wrote: > Please read https://phoenix.apache.org/update_statistics.html > > FYI > > On Mon, Feb 20, 2017 at 8:14 AM, Anil wrote: > > > Hi Ted, > > > > its very

Re: Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
On Mon, Feb 20, 2017 at 9:29 PM, Edward Capriolo wrote: > Ap systems are not available in practice.. > > Cp systems can be made highly available. > > Sounds like they are arguing ap is not ap, but somehow cp can be ap. > > Then google can label failures as 'incidents' and

Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
Ap systems are not available in practice.. Cp systems can be made highly available. Sounds like they are arguing ap is not ap, but somehow cp can be ap. Then google can label failures as 'incidents' and cp and ap are unefected. I swear foundation db claimed it solved cap, too bad foundationdb

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Please read https://phoenix.apache.org/update_statistics.html FYI On Mon, Feb 20, 2017 at 8:14 AM, Anil wrote: > Hi Ted, > > its very difficult to predict the data distribution. we store parent to > child relationships in the table. (Note : A parent record is child for >

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted, its very difficult to predict the data distribution. we store parent to child relationships in the table. (Note : A parent record is child for itself ) we set the max hregion file size as 10gb. I don't think we have any control on region size :( Thanks On 20 February 2017 at 21:24,

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Among the 5 columns, do you know roughly the data distribution ? You should put the columns whose data distribution is relatively even first. Of course, there may be business requirement which you take into consideration w.r.t. the composite key. If you cannot change the schema, do you have

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted, Current region size is 10 GB. Hbase row key designed like a phoenix primary key. I can say it is like 5 column composite key. Prefix for a common set of data would have same first prefix. I am not sure how to convey the data distribution. Thanks. On 20 February 2017 at 20:48, Ted Yu

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Anil: What's the current region size you use ? Given a region, do you have some idea how the data is distributed within the region ? Cheers On Mon, Feb 20, 2017 at 7:14 AM, Anil wrote: > i understand my original post now :) Sorry about that. > > now the challenge is to

Re: Parallel Scanner

2017-02-20 Thread Anil
i understand my original post now :) Sorry about that. now the challenge is to split a start key and end key at client side to allow parallel scans on table with no buckets, pre-salting. Thanks. On 20 February 2017 at 20:21, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote: >

Re: Parallel Scanner

2017-02-20 Thread ramkrishna vasudevan
You are trying to scan one region itself in parallel, then even I got you wrong. Richard's suggestion is the right choice for client only soln. On Mon, Feb 20, 2017 at 7:40 PM, Anil wrote: > Thanks Richard :) > > On 20 February 2017 at 18:56, Richard Startin

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard :) On 20 February 2017 at 18:56, Richard Startin wrote: > RegionLocator is not deprecated, hence the suggestion to use it if it's > available in place of whatever is still available on HTable for your > version of HBase - it will make upgrades easier.

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard. I am able to get the regions for data to be loaded from table. I am trying to scan a region in parallel :) Thanks On 20 February 2017 at 16:44, Richard Startin wrote: > For a client only solution, have you looked at the RegionLocator > interface? It

Re: Parallel Scanner

2017-02-20 Thread Richard Startin
For a client only solution, have you looked at the RegionLocator interface? It gives you a list of pairs of byte[] (the start and stop keys for each region). You can easily use a ForkJoinPool recursive task or java 8 parallel stream over that list. I implemented a spark RDD to do that and wrote