Michael, I will do the redesign and build the index. Thanks a lot for the insights.
Narendra On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel <michael_se...@hotmail.com>wrote: > Narendra, > > I think you are still missing the point. > 130 seconds to scan the table per iteration. > Even if you have 10K rows > 130 * 10^4 or 1.3*10^6 seconds. ~361 hours > > Compare that to 10K rows where you then select a single row in your sub > select that has a list of all of the associated rows. > You can then do n number of get()s based on the data in the index. (If > the data wasn't in the index itself) > > Assuming that the data was in the index, that's one get(). This is sub > second. > Just to keep things simple assume 1 second. > That's 10K seconds vs 1.3 million seconds. (2 hours vs 361hours) > Actually its more like 10ms so its 100 seconds to run your code. (So its > like 2 minutes or so) > > Also since you're doing less work, you put less strain on the system. > > Look, you're asking for help. You're fighting to maintain a bad design. > Building the index table shouldn't take you more than a day to think, > design and implement. > > So you tell me, 2 minutes vs 361 hours. Which would you choose? > > HTH > > -Mike > > > On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote: > > > Michael, > > > > Thanks for the response. This is a real problem and not a class project. > > Boxes itself costed 9k ;) > > > > I think there is some difference in understanding of the problem. The > table > > has 2m rows but I am looking at the latest 10k rows only in the outer for > > loop. Only in the inner for loop i am trying to get all rows that contain > > the url that is given by the row in the outer for loop. So pseudo code is > > like this > > > > All scanners have a caching of 128. > > > > Scanner outerScanner = tweetTable.getScanner(new Scan()); //This gets > the > > entire row > > for (int index = 0; index < 10000; index++) { > > Result tweet = outerScanner.next(); > > NavigableMap<byte[],byte[]> linkFamilyMap = > > tweet.getFamilyMap(Bytes.toBytes("link")); > > String url = Bytes.toString( linkFamilyMap.firstKey()); //assuming only > > one link is there in the tweet. > > Scan linkScan = new Scan(); > > linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get only > > the link column family > > Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for > > loop is taking 2 sec per sc > > for (Result linkResult = linkScanner.next(); linkResult != null; > > linkResult = linkScanner.next()) { > > //do something with the link > > } > > linkScanner.close(); > > > > //do a similar for loop for hashtags > > } > > > > Each of my inner for loop is taking around 20 seconds (or more depending > on > > number of rows returned by that particular scanner) for each of the 10k > > rows that I am processing and this is also triggering a lot of GC in > turn. > > So it is 10000*40 seconds (4 days) for each thread. But the problem is > that > > the batch process crashes before completion throwing IOException and > > SocketTimeoutException and sometimes GC OutOfMemory exceptions. > > > > I will definitely take the much elegant approach that you mentioned > > eventually. I just wanted to get to the core of the issue before choosing > > the solution. > > > > Thanks again. > > Narendra > > > > On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel <michael_se...@hotmail.com > >wrote: > > > >> Narendra, > >> > >> Are you trying to solve a real problem, or is this a class project? > >> > >> Your solution doesn't scale. It's a non starter. 130 seconds for each > >> iteration times 1 million seconds is how long? 130 million seconds, > which > >> is ~36000 hours or over 4 years to complete. > >> (the numbers are rough but you get the idea...) > >> > >> That's assuming that your table is static and doesn't change. > >> > >> I didn't even ask if you were attempting any sort of server side > filtering > >> which would reduce the amount of data you send back to the client > because > >> it a moot point. > >> > >> Finer tuning is also moot. > >> > >> So you insert a row in one table. You then do n^2 operations to pull out > >> data. > >> The better solution is to insert data into 2 tables where you then have > to > >> do 2n operations to get the same results. Thats per thread btw. So if > you > >> were running 10 threads, you would have 10n^2 operations versus 20n > >> operations to get the same result set. > >> > >> A million row table... 1*10^13. Vs 2*10^6 > >> > >> I don't believe I mentioned anything about HBase's internals and this > >> solution works for any NoSQL database. > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Apr 19, 2012, at 7:03 AM, Narendra yadala <narendra.yad...@gmail.com > > > >> wrote: > >> > >>> Hi Michel > >>> > >>> Yes, that is exactly what I do in step 2. I am aware of the reason for > >> the > >>> scanner timeout exceptions. It is the time between two consecutive > >>> invocations of the next call on a specific scanner object. I increased > >> the > >>> scanner timeout to 10 min on the region server and still I keep seeing > >> the > >>> timeouts. So I reduced my scanner cache to 128. > >>> > >>> Full table scan takes 130 seconds and there are 2.2 million rows in the > >>> table as of now. Each row is around 2 KB in size. I measured time for > the > >>> full table scan by issuing `count` command from the hbase shell. > >>> > >>> I kind of understood the fix that you are specifying, but do I need to > >>> change the table structure to fix this problem? All I do is a n^2 > >> operation > >>> and even that fails with 10 different types of exceptions. It is mildly > >>> annoying that I need to know all the low level storage details of HBase > >> to > >>> do such a simple operation. And this is happening for just 14 parallel > >>> scanners. I am wondering what would happen when there are thousands of > >>> parallel scanners. > >>> > >>> Please let me know if there is any configuration param change which > would > >>> fix this issue. > >>> > >>> Thanks a lot > >>> Narendra > >>> > >>> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel < > michael_se...@hotmail.com > >>> wrote: > >>> > >>>> So in your step 2 you have the following: > >>>> FOREACH row IN TABLE alpha: > >>>> SELECT something > >>>> FROM TABLE alpha > >>>> WHERE alpha.url = row.url > >>>> > >>>> Right? > >>>> And you are wondering why you are getting timeouts? > >>>> ... > >>>> ... > >>>> And how long does it take to do a full table scan? ;-) > >>>> (there's more, but that's the first thing you should see...) > >>>> > >>>> Try creating a second table where you invert the URL and key pair such > >>>> that for each URL, you have a set of your alpha table's keys? > >>>> > >>>> Then you have the following... > >>>> FOREACH row IN TABLE alpha: > >>>> FETCH key-set FROM beta > >>>> WHERE beta.rowkey = alpha.url > >>>> > >>>> Note I use FETCH to signify that you should get a single row in > >> response. > >>>> > >>>> Does this make sense? > >>>> ( your second table is actually and index of the URL column in your > >> first > >>>> table) > >>>> > >>>> HTH > >>>> > >>>> Sent from a remote device. Please excuse any typos... > >>>> > >>>> Mike Segel > >>>> > >>>> On Apr 19, 2012, at 5:43 AM, Narendra yadala < > narendra.yad...@gmail.com > >>> > >>>> wrote: > >>>> > >>>>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop > >>>> (4*32 > >>>>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera > >> distribution > >>>>> for maintaining our cluster. I have a single tweets table in which we > >>>> store > >>>>> the tweets, one tweet per row (it has millions of rows currently). > >>>>> > >>>>> Now I try to run a Java batch (not a map reduce) which does the > >>>> following : > >>>>> > >>>>> 1. Open a scanner over the tweet table and read the tweets one after > >>>>> another. I set scanner caching to 128 rows as higher scanner caching > >> is > >>>>> leading to ScannerTimeoutExceptions. I scan over the first 10k rows > >>>> only. > >>>>> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are > >> there > >>>>> in that tweet and open another scanner over the tweets table to see > >> who > >>>>> else shared that link. This involves getting rows having that URL > from > >>>> the > >>>>> entire table (not first 10k rows). > >>>>> 3. Do similar stuff as in step 2 for hashtags > >>>>> (hashtagcolfamily:hashtagvalue). > >>>>> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This > number > >>>>> can be higher (thousands also) later. > >>>>> > >>>>> > >>>>> When I run this batch I got the GC issue which is specified here > >>>>> > >>>> > >> > http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > >>>>> Then I tried to turn on the MSLAB feature and changed the GC settings > >> by > >>>>> specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. > >>>>> Even after doing this, I am running into all kinds of IOExceptions > >>>>> and SocketTimeoutExceptions. > >>>>> > >>>>> This Java batch opens approximately 7*2 (14) scanners open at a point > >> in > >>>>> time and still I am running into all kinds of troubles. I am > wondering > >>>>> whether I can have thousands of parallel scanners with HBase when I > >> need > >>>> to > >>>>> scale. > >>>>> > >>>>> It would be great to know whether I can open thousands/millions of > >>>> scanners > >>>>> in parallel with HBase efficiently. > >>>>> > >>>>> Thanks > >>>>> Narendra > >>>> > >> > >