So in your step 2 you have the following: FOREACH row IN TABLE alpha: SELECT something FROM TABLE alpha WHERE alpha.url = row.url
Right? And you are wondering why you are getting timeouts? ... ... And how long does it take to do a full table scan? ;-) (there's more, but that's the first thing you should see...) Try creating a second table where you invert the URL and key pair such that for each URL, you have a set of your alpha table's keys? Then you have the following... FOREACH row IN TABLE alpha: FETCH key-set FROM beta WHERE beta.rowkey = alpha.url Note I use FETCH to signify that you should get a single row in response. Does this make sense? ( your second table is actually and index of the URL column in your first table) HTH Sent from a remote device. Please excuse any typos... Mike Segel On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com> wrote: > I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32 > GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution > for maintaining our cluster. I have a single tweets table in which we store > the tweets, one tweet per row (it has millions of rows currently). > > Now I try to run a Java batch (not a map reduce) which does the following : > > 1. Open a scanner over the tweet table and read the tweets one after > another. I set scanner caching to 128 rows as higher scanner caching is > leading to ScannerTimeoutExceptions. I scan over the first 10k rows only. > 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there > in that tweet and open another scanner over the tweets table to see who > else shared that link. This involves getting rows having that URL from the > entire table (not first 10k rows). > 3. Do similar stuff as in step 2 for hashtags > (hashtagcolfamily:hashtagvalue). > 4. Do steps 1-3 in parallel for approximately 7-8 threads. This number > can be higher (thousands also) later. > > > When I run this batch I got the GC issue which is specified here > http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > Then I tried to turn on the MSLAB feature and changed the GC settings by > specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. > Even after doing this, I am running into all kinds of IOExceptions > and SocketTimeoutExceptions. > > This Java batch opens approximately 7*2 (14) scanners open at a point in > time and still I am running into all kinds of troubles. I am wondering > whether I can have thousands of parallel scanners with HBase when I need to > scale. > > It would be great to know whether I can open thousands/millions of scanners > in parallel with HBase efficiently. > > Thanks > Narendra