Michael, Thanks for the response. This is a real problem and not a class project. Boxes itself costed 9k ;)
I think there is some difference in understanding of the problem. The table has 2m rows but I am looking at the latest 10k rows only in the outer for loop. Only in the inner for loop i am trying to get all rows that contain the url that is given by the row in the outer for loop. So pseudo code is like this All scanners have a caching of 128. Scanner outerScanner = tweetTable.getScanner(new Scan()); //This gets the entire row for (int index = 0; index < 10000; index++) { Result tweet = outerScanner.next(); NavigableMap<byte[],byte[]> linkFamilyMap = tweet.getFamilyMap(Bytes.toBytes("link")); String url = Bytes.toString( linkFamilyMap.firstKey()); //assuming only one link is there in the tweet. Scan linkScan = new Scan(); linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get only the link column family Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for loop is taking 2 sec per sc for (Result linkResult = linkScanner.next(); linkResult != null; linkResult = linkScanner.next()) { //do something with the link } linkScanner.close(); //do a similar for loop for hashtags } Each of my inner for loop is taking around 20 seconds (or more depending on number of rows returned by that particular scanner) for each of the 10k rows that I am processing and this is also triggering a lot of GC in turn. So it is 10000*40 seconds (4 days) for each thread. But the problem is that the batch process crashes before completion throwing IOException and SocketTimeoutException and sometimes GC OutOfMemory exceptions. I will definitely take the much elegant approach that you mentioned eventually. I just wanted to get to the core of the issue before choosing the solution. Thanks again. Narendra On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel <michael_se...@hotmail.com>wrote: > Narendra, > > Are you trying to solve a real problem, or is this a class project? > > Your solution doesn't scale. It's a non starter. 130 seconds for each > iteration times 1 million seconds is how long? 130 million seconds, which > is ~36000 hours or over 4 years to complete. > (the numbers are rough but you get the idea...) > > That's assuming that your table is static and doesn't change. > > I didn't even ask if you were attempting any sort of server side filtering > which would reduce the amount of data you send back to the client because > it a moot point. > > Finer tuning is also moot. > > So you insert a row in one table. You then do n^2 operations to pull out > data. > The better solution is to insert data into 2 tables where you then have to > do 2n operations to get the same results. Thats per thread btw. So if you > were running 10 threads, you would have 10n^2 operations versus 20n > operations to get the same result set. > > A million row table... 1*10^13. Vs 2*10^6 > > I don't believe I mentioned anything about HBase's internals and this > solution works for any NoSQL database. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 19, 2012, at 7:03 AM, Narendra yadala <narendra.yad...@gmail.com> > wrote: > > > Hi Michel > > > > Yes, that is exactly what I do in step 2. I am aware of the reason for > the > > scanner timeout exceptions. It is the time between two consecutive > > invocations of the next call on a specific scanner object. I increased > the > > scanner timeout to 10 min on the region server and still I keep seeing > the > > timeouts. So I reduced my scanner cache to 128. > > > > Full table scan takes 130 seconds and there are 2.2 million rows in the > > table as of now. Each row is around 2 KB in size. I measured time for the > > full table scan by issuing `count` command from the hbase shell. > > > > I kind of understood the fix that you are specifying, but do I need to > > change the table structure to fix this problem? All I do is a n^2 > operation > > and even that fails with 10 different types of exceptions. It is mildly > > annoying that I need to know all the low level storage details of HBase > to > > do such a simple operation. And this is happening for just 14 parallel > > scanners. I am wondering what would happen when there are thousands of > > parallel scanners. > > > > Please let me know if there is any configuration param change which would > > fix this issue. > > > > Thanks a lot > > Narendra > > > > On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <michael_se...@hotmail.com > >wrote: > > > >> So in your step 2 you have the following: > >> FOREACH row IN TABLE alpha: > >> SELECT something > >> FROM TABLE alpha > >> WHERE alpha.url = row.url > >> > >> Right? > >> And you are wondering why you are getting timeouts? > >> ... > >> ... > >> And how long does it take to do a full table scan? ;-) > >> (there's more, but that's the first thing you should see...) > >> > >> Try creating a second table where you invert the URL and key pair such > >> that for each URL, you have a set of your alpha table's keys? > >> > >> Then you have the following... > >> FOREACH row IN TABLE alpha: > >> FETCH key-set FROM beta > >> WHERE beta.rowkey = alpha.url > >> > >> Note I use FETCH to signify that you should get a single row in > response. > >> > >> Does this make sense? > >> ( your second table is actually and index of the URL column in your > first > >> table) > >> > >> HTH > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com > > > >> wrote: > >> > >>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop > >> (4*32 > >>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera > distribution > >>> for maintaining our cluster. I have a single tweets table in which we > >> store > >>> the tweets, one tweet per row (it has millions of rows currently). > >>> > >>> Now I try to run a Java batch (not a map reduce) which does the > >> following : > >>> > >>> 1. Open a scanner over the tweet table and read the tweets one after > >>> another. I set scanner caching to 128 rows as higher scanner caching > is > >>> leading to ScannerTimeoutExceptions. I scan over the first 10k rows > >> only. > >>> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are > there > >>> in that tweet and open another scanner over the tweets table to see > who > >>> else shared that link. This involves getting rows having that URL from > >> the > >>> entire table (not first 10k rows). > >>> 3. Do similar stuff as in step 2 for hashtags > >>> (hashtagcolfamily:hashtagvalue). > >>> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This number > >>> can be higher (thousands also) later. > >>> > >>> > >>> When I run this batch I got the GC issue which is specified here > >>> > >> > http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > >>> Then I tried to turn on the MSLAB feature and changed the GC settings > by > >>> specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. > >>> Even after doing this, I am running into all kinds of IOExceptions > >>> and SocketTimeoutExceptions. > >>> > >>> This Java batch opens approximately 7*2 (14) scanners open at a point > in > >>> time and still I am running into all kinds of troubles. I am wondering > >>> whether I can have thousands of parallel scanners with HBase when I > need > >> to > >>> scale. > >>> > >>> It would be great to know whether I can open thousands/millions of > >> scanners > >>> in parallel with HBase efficiently. > >>> > >>> Thanks > >>> Narendra > >> >