Re: Parallel Scanner

2017-02-20 Thread Anil
gion, you could > write a > > > > > little > > > > > > > > > method to split the first and last key of the region into > > > several > > > > > > > > disjoint > > > > > > > > > lexicograph

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
split the first and last key of the region into > > several > > > > > > > disjoint > > > > > > > > lexicographic buckets and create a scan for each bucket, then > > > > execute > > > > > > > those > > > > &

Re: Parallel Scanner

2017-02-20 Thread Anil
gt; > > scans in parallel. Your data probably doesn't distribute > > uniformly > > > > over > > > > > > > lexicographic buckets though so the scans are unlikely to > execute > > > at > > > > a > > > > > > > constant rate and you'll get results in time proportional to > the > > > >

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
> at > > > a > > > > > > constant rate and you'll get results in time proportional to the > > > > > > lexicographic bucket with the highest cardinality in the region. > > I'd > > > be > > > > > > interested to know if anyone on the list has ever tried this and > > what > > &

Re: Parallel Scanner

2017-02-20 Thread Anil
; what > > > the > > > > > results were? > > > > > > > > > > > > > > > Using the much simpler approach of parallelising over regions by > > > creating > > > > > multiple disjoint scans client side, as s

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
ults were? > > > > > > > > > > > > Using the much simpler approach of parallelising over regions by > > creating > > > > multiple disjoint scans client side, as suggested, your performance > now > > > > depends on your regions which you have some contro

Re: Parallel Scanner

2017-02-20 Thread Anil
ance now > > > depends on your regions which you have some control over. You can > achieve > > > the same effect by pre-splitting your table such that you empirically > > > optimise read performance for the dataset you store. > > > > > > > &g

Re: Parallel Scanner

2017-02-20 Thread ramkrishna vasudevan
egion start key and end key > > > before initiating scan operations for every initial load. > > > > > > Thanks. > > > > > > > > > > > > > > > On 20 February 2017 at 10:59, ramkrishna vasudevan < > > > ramkrishna.s.vasu

Re: Parallel Scanner

2017-02-20 Thread Anil
now > depends on your regions which you have some control over. You can achieve > the same effect by pre-splitting your table such that you empirically > optimise read performance for the dataset you store. > > > Thanks, > > Richard > > > _______

Re: Parallel Scanner

2017-02-20 Thread Anil
issue parallel scans from your app. > > One thing to watch out is that if there is a split in the region then > this > start and end row may change so in that case it is better you try to > get > the regions every time before you issue a scan. Does that make sense to > you? &g

Re: Parallel Scanner

2017-02-20 Thread Richard Startin
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote: Hi , I am building an usecase where i have to load the hbase data into In-memory database (IMDB). I am scanning the each region and loading data into IMDB. i am looking at parallel scanner ( https://issues.apache.org/ jira/brows

Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
; > you? > > > > Regards > > Ram > > > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote: > > > > > Hi , > > > > > > I am building an usecase where i have to load the hbase data into > > In-memory > > >

Re: Parallel Scanner

2017-02-19 Thread Anil
gt; > Regards > Ram > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote: > > > Hi , > > > > I am building an usecase where i have to load the hbase data into > In-memory > > database (IMDB). I am scanning the each region and loading data into

Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
wrote: > Hi , > > I am building an usecase where i have to load the hbase data into In-memory > database (IMDB). I am scanning the each region and loading data into IMDB. > > i am looking at parallel scanner ( https://issues.apache.org/ > jira/browse/HBASE-8504, HBASE-193

Parallel Scanner

2017-02-18 Thread Anil
Hi , I am building an usecase where i have to load the hbase data into In-memory database (IMDB). I am scanning the each region and loading data into IMDB. i am looking at parallel scanner ( https://issues.apache.org/ jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable

Parallel Scanner

2017-02-18 Thread Anil
Hi , I am building an usecase where i have to load the hbase data into In-memory database (IMDB). I am scanning the each region and loading data into IMDB. i am looking at parallel scanner ( https://issues.apache.org/jira/browse/HBASE-8504 ) and HTable# getRegionsInRange(byte[] startKey, byte

Re: HBase parallel scanner performance

2012-05-19 Thread S Ahmed
great thread for a real world problem. Michael, it sounds like the initial design was more of a traditional db solution, whereas with hbase (and nosql in general) the design is to denormalize and build your row/cf structure to fit the use case. Disks are cheap, writes are fast, so build your

HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32 GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution for maintaining our cluster. I have a single tweets table in which we store the tweets, one tweet per row (it has millions of rows currently). Now I

Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
So in your step 2 you have the following: FOREACH row IN TABLE alpha: SELECT something FROM TABLE alpha WHERE alpha.url = row.url Right? And you are wondering why you are getting timeouts? ... ... And how long does it take to do a full table scan? ;-) (there's more, but that's the

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Hi Michel Yes, that is exactly what I do in step 2. I am aware of the reason for the scanner timeout exceptions. It is the time between two consecutive invocations of the next call on a specific scanner object. I increased the scanner timeout to 10 min on the region server and still I keep seeing

Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
Narendra, Are you trying to solve a real problem, or is this a class project? Your solution doesn't scale. It's a non starter. 130 seconds for each iteration times 1 million seconds is how long? 130 million seconds, which is ~36000 hours or over 4 years to complete. (the numbers are rough but

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
- From: Narendra yadala [mailto:narendra.yad...@gmail.com] Sent: Thursday, April 19, 2012 8:04 PM To: user@hbase.apache.org Subject: Re: HBase parallel scanner performance Hi Michel Yes, that is exactly what I do in step 2. I am aware of the reason for the scanner timeout exceptions

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael, Thanks for the response. This is a real problem and not a class project. Boxes itself costed 9k ;) I think there is some difference in understanding of the problem. The table has 2m rows but I am looking at the latest 10k rows only in the outer for loop. Only in the inner for loop i am

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
, did you keep an eye on the GC logs? Thank you. Regards, Jieshan -Original Message- From: Narendra yadala [mailto:narendra.yad...@gmail.com] Sent: Thursday, April 19, 2012 8:04 PM To: user@hbase.apache.org Subject: Re: HBase parallel scanner performance Hi Michel Yes

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
Narendra, I think you are still missing the point. 130 seconds to scan the table per iteration. Even if you have 10K rows 130 * 10^4 or 1.3*10^6 seconds. ~361 hours Compare that to 10K rows where you then select a single row in your sub select that has a list of all of the associated rows.

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
the stop-the-world pause time. I don't think parallel scanners is the problem. Jieshan -Original Message- From: Narendra yadala [mailto:narendra.yad...@gmail.com] Sent: Thursday, April 19, 2012 11:24 PM To: user@hbase.apache.org Subject: Re: HBase parallel scanner performance Hi Jieshan

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael, I will do the redesign and build the index. Thanks a lot for the insights. Narendra On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel michael_se...@hotmail.comwrote: Narendra, I think you are still missing the point. 130 seconds to scan the table per iteration. Even if you have 10K

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
No problem. One of the hardest things to do is to try to be open to other design ideas and not become wedded to one. I think once you get that working you can start to look at your cluster. On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote: Michael, I will do the redesign and build the