Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read:
blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>: > In 0.94 > > The UI of the RS has a metrics table. In that you can see > blockCacheHitCount, blockCacheMissCount etc. May be there is a variation > when you do scan() and get() here. > > Regards > Ram > > > >> -----Original Message----- >> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] >> Sent: Thursday, June 28, 2012 4:44 PM >> To: user@hbase.apache.org >> Subject: Re: Scan vs Put vs Get >> >> Wow. First, thanks a lot all for jumping into this. >> >> Let me try to reply to everyone in a single post. >> >> > How many Gets you batch together in one call >> I tried with multiple different values from 10 to 3000 with similar >> results. >> Time to read 10 lines : 181.0 mseconds (55 lines/seconds) >> Time to read 100 lines : 484.0 mseconds (207 lines/seconds) >> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) >> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) >> >> > Is this equal to the Scan#setCaching () that u are using? >> The scan call is done after the get test. So I can't set the cache for >> the scan before I do the gets. Also, I tried to run them separatly (On >> time only the put, one time only the get, etc.) so I did not find a >> way to setup the cache for the get. >> >> > If both are same u can be sure that the the number of NW calls is >> coming almost same. >> Here are the results for 10 000 gets and 10 000 scan.next(). Each time >> I access the result to be sure they are sent to the client. >> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds) >> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds) >> >> >[Block caching is enabled?] >> Good question. I don't know :( Is it enabled by default? How can I >> verify or activate it? >> >> > Also have you tried using Bloom filters? >> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) >> >> >> > What's the hbase version you're using? >> I manually installed 0.94.0. I can try with an other version. >> >> > Is it repeatable? >> Yes. I tries many many times by adding some options, closing some >> process on the server side, remonving one datanode, adding one, etc. I >> can see some small variations, but still in the same range. I was able >> to move from 200 rows/second to 300 rows/second. But that's not >> really a significant improvment. Also, here are the results for 7 >> iterations of the same code. >> >> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) >> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) >> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) >> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) >> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) >> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) >> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) >> >> >If the locations are wrong (region moved) you will have a retry loop >> I have one dead region. It's a server I brought down few days ago >> because it was to slow. But it's still on the hbase web interface. >> However, if I look at the table, there is no table region hosted on >> this server. Hadoop also was removed from it so it's saying one dead >> node. >> >> >Do you have anything in the logs? >> Nothing special. Only some "Block cache LRU eviction" entries. >> >> > Could you share as well the code >> Eveything is at the end of this post. >> >> >You can also check the cache hit and cache miss statistics that >> appears on >> the UI? >> Can you please tell me how I can find that? I was not able to find >> that on the web UI. Where should I look? >> >> > In your random scan how many Regions are scanned >> I only have 5 regions servers and 12 table regions. So I guess all the >> servers are called. >> >> >> So here is the code for the gets. I removed the KeyOnlyFilter because >> it's not improving the results. >> >> JM >> >> >> >> >> http://pastebin.com/K75nFiQk (for syntax highligthing) >> >> HTable table = new HTable(config, "test3"); >> >> for (int iteration = 0; iteration < 10; iteration++) >> { >> >> final int linesToRead = 1000; >> System.out.println(new java.util.Date () + " Processing iteration >> " + >> iteration + "... "); >> Vector<Get> gets = new Vector<Get>(linesToRead); >> >> for (long l = 0; l < linesToRead; l++) >> { >> byte[] array1 = new byte[24]; >> for (int i = 0; i < array1.length; i++) >> array1[i] = (byte)Math.floor(Math.random() * 256); >> Get g = new Get (array1); >> gets.addElement(g); >> >> processed++; >> } >> Object[] results = new Object[gets.size()]; >> >> long timeBefore = System.currentTimeMillis(); >> table.batch(gets, results); >> long timeAfter = System.currentTimeMillis(); >> >> float duration = timeAfter - timeBefore; >> System.out.println ("Time to read " + gets.size() + " lines : " + >> duration + " mseconds (" + Math.round(((float)linesToRead / (duration >> / 1000))) + " lines/seconds)"); >> >> >> for (int i = 0; i < results.length; i++) >> { >> if (results[i] instanceof KeyValue) >> if (!((KeyValue)results[i]).isEmptyColumn()) >> System.out.println("Result[" + i + "]: " + >> results[i]); // co >> BatchExample-9-Dump Print all results. >> } >> >> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>: >> > Hi >> > >> > You can also check the cache hit and cache miss statistics that >> appears on >> > the UI? >> > >> > In your random scan how many Regions are scanned whereas in gets may >> be >> > many >> > due to randomness. >> > >> > Regards >> > Ram >> > >> >> -----Original Message----- >> >> From: N Keywal [mailto:nkey...@gmail.com] >> >> Sent: Thursday, June 28, 2012 2:00 PM >> >> To: user@hbase.apache.org >> >> Subject: Re: Scan vs Put vs Get >> >> >> >> Hi Jean-Marc, >> >> >> >> Interesting.... :-) >> >> >> >> Added to Anoop questions: >> >> >> >> What's the hbase version you're using? >> >> >> >> Is it repeatable, I mean if you try twice the same "gets" with the >> >> same client do you have the same results? I'm asking because the >> >> client caches the locations. >> >> >> >> If the locations are wrong (region moved) you will have a retry >> loop, >> >> and it includes a sleep. Do you have anything in the logs? >> >> >> >> Could you share as well the code you're using to get the ~100 ms >> time? >> >> >> >> Cheers, >> >> >> >> N. >> >> >> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoo...@huawei.com> >> >> wrote: >> >> > Hi >> >> > How many Gets you batch together in one call? Is this equal to >> >> the Scan#setCaching () that u are using? >> >> > If both are same u can be sure that the the number of NW calls is >> >> coming almost same. >> >> > >> >> > Also you are giving random keys in the Gets. The scan will be >> always >> >> sequential. Seems in your get scenario it is very very random reads >> >> resulting in too many reads of HFile block from HDFS. [Block caching >> is >> >> enabled?] >> >> > >> >> > Also have you tried using Bloom filters? ROW blooms might improve >> >> your get performance. >> >> > >> >> > -Anoop- >> >> > ________________________________________ >> >> > From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] >> >> > Sent: Thursday, June 28, 2012 5:04 AM >> >> > To: user >> >> > Subject: Scan vs Put vs Get >> >> > >> >> > Hi, >> >> > >> >> > I have a small piece of code, for testing, which is putting 1B >> lines >> >> > in an existing table, getting 3000 lines and scanning 10000. >> >> > >> >> > The table is one family, one column. >> >> > >> >> > Everything is done randomly. Put with Random key (24 bytes), fixed >> >> > family and fixed column names with random content (24 bytes). >> >> > >> >> > Get (batch) is done with random keys and scan with >> RandomRowFilter. >> >> > >> >> > And here are the results. >> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds) >> >> > That's correct for my needs based on the poor performances of the >> >> > servers in the cluster. I'm fine with the results. >> >> > >> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) >> >> > This is way to low. I don't understand why. So I tried the random >> >> scan >> >> > because I'm not able to figure the issue. >> >> > >> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds) >> >> > This it impressive! I have added that after I failed with the get. >> I >> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!! >> It's >> >> > awesome! >> >> > >> >> > However, I'm still wondering what's wrong with my gets. >> >> > >> >> > The code is very simple. I'm using Get objects that I'm executing >> in >> >> a >> >> > Batch. I tried to add a filter but it's not helping. Here is an >> >> > extract of the code. >> >> > >> >> > for (long l = 0; l < linesToRead; l++) >> >> > { >> >> > byte[] array1 = new byte[24]; >> >> > for (int i = 0; i < array1.length; >> >> i++) >> >> > array1[i] = >> >> (byte)Math.floor(Math.random() * 256); >> >> > Get g = new Get (array1); >> >> > gets.addElement(g); >> >> > } >> >> > Object[] results = new >> >> Object[gets.size()]; >> >> > System.out.println(new >> java.util.Date >> >> () + " \"gets\" created."); >> >> > long timeBefore = >> >> System.currentTimeMillis(); >> >> > table.batch(gets, results); >> >> > long timeAfter = >> System.currentTimeMillis(); >> >> > >> >> > float duration = timeAfter - timeBefore; >> >> > System.out.println ("Time to read " + >> >> gets.size() + " lines : " >> >> > + duration + " mseconds (" + Math.round(((float)linesToRead / >> >> > (duration / 1000))) + " lines/seconds)"); >> >> > >> >> > What's wrong with it? I can't add the setBatch neither I can add >> >> > setCaching because it's not a scan. I tried with different numbers >> of >> >> > gets but it's almost always the same speed. Am I using it the >> wrong >> >> > way? Does anyone have any advice to improve that? >> >> > >> >> > Thanks, >> >> > >> >> > JM >> > >> > > >