So in your step 2 you have the following:
FOREACH row IN TABLE alpha:
     SELECT something
     FROM TABLE alpha 
     WHERE alpha.url = row.url

Right?
And you are wondering why you are getting timeouts?
...
...
And how long does it take to do a full table scan? ;-)
(there's more, but that's the first thing you should see...)

Try creating a second table where you invert the URL and key pair such that for 
each URL, you have a set of your alpha table's keys?

Then you have the following...
FOREACH row IN TABLE alpha:
   FETCH key-set FROM beta 
   WHERE beta.rowkey = alpha.url

Note I use FETCH to signify that you should get a single row in response.

Does this make sense?
( your second table is actually and index of the URL column in your first table)

HTH 

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com> wrote:

> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
> for maintaining our cluster. I have a single tweets table in which we store
> the tweets, one tweet per row (it has millions of rows currently).
> 
> Now I try to run a Java batch (not a map reduce) which does the following :
> 
>   1. Open a scanner over the tweet table and read the tweets one after
>   another. I set scanner caching to 128 rows as higher scanner caching is
>   leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
>   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
>   in that tweet and open another scanner over the tweets table to see who
>   else shared that link. This involves getting rows having that URL from the
>   entire table (not first 10k rows).
>   3. Do similar stuff as in step 2 for hashtags
>   (hashtagcolfamily:hashtagvalue).
>   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
>   can be higher (thousands also) later.
> 
> 
> When I run this batch I got the GC issue which is specified here
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> Then I tried to turn on the MSLAB feature and changed the GC settings by
> specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> Even after doing this, I am running into all kinds of IOExceptions
> and SocketTimeoutExceptions.
> 
> This Java batch opens approximately 7*2 (14) scanners open at a point in
> time and still I am running into all kinds of troubles. I am wondering
> whether I can have thousands of parallel scanners with HBase when I need to
> scale.
> 
> It would be great to know whether I can open thousands/millions of scanners
> in parallel with HBase efficiently.
> 
> Thanks
> Narendra

Reply via email to