Re: HBase parallel scanner performance

Narendra yadala Thu, 19 Apr 2012 05:04:28 -0700

Hi Michel

Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions. It is the time between two consecutive
invocations of the next call on a specific scanner object. I increased the
scanner timeout to 10 min on the region server and still I keep seeing the
timeouts. So I reduced my scanner cache to 128.


Full table scan takes 130 seconds and there are 2.2 million rows in the
table as of now. Each row is around 2 KB in size. I measured time for the
full table scan by issuing `count` command from the hbase shell.

I kind of understood the fix that you are specifying, but do I need to
change the table structure to fix this problem? All I do is a n^2 operation
and even that fails with 10 different types of exceptions. It is mildly
annoying that I need to know all the low level storage details of HBase to
do such a simple operation. And this is happening for just 14 parallel
scanners. I am wondering what would happen when there are thousands of
parallel scanners.

Please let me know if there is any configuration param change which would
fix this issue.

Thanks a lot
Narendra

On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <michael_se...@hotmail.com>wrote:

> So in your step 2 you have the following:
> FOREACH row IN TABLE alpha:
>     SELECT something
>     FROM TABLE alpha
>     WHERE alpha.url = row.url
>
> Right?
> And you are wondering why you are getting timeouts?
> ...
> ...
> And how long does it take to do a full table scan? ;-)
> (there's more, but that's the first thing you should see...)
>
> Try creating a second table where you invert the URL and key pair such
> that for each URL, you have a set of your alpha table's keys?
>
> Then you have the following...
> FOREACH row IN TABLE alpha:
>   FETCH key-set FROM beta
>   WHERE beta.rowkey = alpha.url
>
> Note I use FETCH to signify that you should get a single row in response.
>
> Does this make sense?
> ( your second table is actually and index of the URL column in your first
> table)
>
> HTH
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com>
> wrote:
>
> > I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
> (4*32
> > GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
> > for maintaining our cluster. I have a single tweets table in which we
> store
> > the tweets, one tweet per row (it has millions of rows currently).
> >
> > Now I try to run a Java batch (not a map reduce) which does the
> following :
> >
> >   1. Open a scanner over the tweet table and read the tweets one after
> >   another. I set scanner caching to 128 rows as higher scanner caching is
> >   leading to ScannerTimeoutExceptions. I scan over the first 10k rows
> only.
> >   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
> >   in that tweet and open another scanner over the tweets table to see who
> >   else shared that link. This involves getting rows having that URL from
> the
> >   entire table (not first 10k rows).
> >   3. Do similar stuff as in step 2 for hashtags
> >   (hashtagcolfamily:hashtagvalue).
> >   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
> >   can be higher (thousands also) later.
> >
> >
> > When I run this batch I got the GC issue which is specified here
> >
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> > Then I tried to turn on the MSLAB feature and changed the GC settings by
> > specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> > Even after doing this, I am running into all kinds of IOExceptions
> > and SocketTimeoutExceptions.
> >
> > This Java batch opens approximately 7*2 (14) scanners open at a point in
> > time and still I am running into all kinds of troubles. I am wondering
> > whether I can have thousands of parallel scanners with HBase when I need
> to
> > scale.
> >
> > It would be great to know whether I can open thousands/millions of
> scanners
> > in parallel with HBase efficiently.
> >
> > Thanks
> > Narendra
>

Re: HBase parallel scanner performance

Reply via email to