Hi N Keywal, This result: Time to read 10000 lines : 122.0 mseconds (81967 lines/seconds)
Is obtain with this code: HTable table = new HTable(config, "test3"); final int linesToRead = 10000; System.out.println(new java.util.Date () + " Processing iteration " + iteration + "... "); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); scan.setFilter(rrf); scan.setFilter(kof); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { if (result != null) processed++; } scanner.close(); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println ("Time to read " + linesToRead + " lines : " + duration + " mseconds (" + Math.round(((float)linesToRead / (duration / 1000))) + " lines/seconds)"); table.close (); This is with the scan. scan > 80 000 lines/seconds put > 20 000 lines/seconds get < 300 lines/seconds 2012/6/28, Jean-Marc Spaggiari <jean-m...@spaggiari.org>: > Hi Anoop, > > Are Bloom filters for columns? If I add "g.setFilter(new > KeyOnlyFilter());" that mean I can't use bloom filters, right? > Basically, what I'm doing here is something like > "existKey(byte[]):boolean" where I try to see if a key exist in the > database whitout taking into consideration if there is any column > content or not. This should be very fast. Even faster than the scan > which need to keep some tracks of where I'm reading for the next row. > > JM > > 2012/6/28, Anoop Sam John <anoo...@huawei.com>: >>>blockCacheHitRatio=69% >> Seems blocks you are getting from cache. >> You can check with Blooms also once. >> >> You can enable the usage of bloom using the config param >> "io.storefile.bloom.enabled" set to true . This will enable the usage of >> bloom globally >> Now you need to set the bloom type for your CF >> HColumnDescriptor#setBloomFilterType() U can check with type >> BloomType.ROW >> >> -Anoop- >> >> _____________________________________ >> From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] >> Sent: Thursday, June 28, 2012 5:42 PM >> To: user@hbase.apache.org >> Subject: Re: Scan vs Put vs Get >> >> Oh! I never looked at this part ;) Ok. I have it. >> >> Here are the numbers for one server before the read: >> >> blockCacheSizeMB=186.28 >> blockCacheFreeMB=55.4 >> blockCacheCount=2923 >> blockCacheHitCount=195999 >> blockCacheMissCount=89297 >> blockCacheEvictedCount=69858 >> blockCacheHitRatio=68% >> blockCacheHitCachingRatio=72% >> >> And here are the numbers after 100 iterations of 1000 gets for the same >> server: >> >> blockCacheSizeMB=194.44 >> blockCacheFreeMB=47.25 >> blockCacheCount=3052 >> blockCacheHitCount=232034 >> blockCacheMissCount=103250 >> blockCacheEvictedCount=83682 >> blockCacheHitRatio=69% >> blockCacheHitCachingRatio=72% >> >> Don't forget that there is between 40B and 50B of lines in the table, >> so I don't think the servers can store all of them in memory. And >> since I'm accessing based on a random key, odds to have the right row >> in memory are small I think. >> >> JM >> >> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>: >>> In 0.94 >>> >>> The UI of the RS has a metrics table. In that you can see >>> blockCacheHitCount, blockCacheMissCount etc. May be there is a >>> variation >>> when you do scan() and get() here. >>> >>> Regards >>> Ram >>> >>> >>> >>>> -----Original Message----- >>>> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] >>>> Sent: Thursday, June 28, 2012 4:44 PM >>>> To: user@hbase.apache.org >>>> Subject: Re: Scan vs Put vs Get >>>> >>>> Wow. First, thanks a lot all for jumping into this. >>>> >>>> Let me try to reply to everyone in a single post. >>>> >>>> > How many Gets you batch together in one call >>>> I tried with multiple different values from 10 to 3000 with similar >>>> results. >>>> Time to read 10 lines : 181.0 mseconds (55 lines/seconds) >>>> Time to read 100 lines : 484.0 mseconds (207 lines/seconds) >>>> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) >>>> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) >>>> >>>> > Is this equal to the Scan#setCaching () that u are using? >>>> The scan call is done after the get test. So I can't set the cache for >>>> the scan before I do the gets. Also, I tried to run them separatly (On >>>> time only the put, one time only the get, etc.) so I did not find a >>>> way to setup the cache for the get. >>>> >>>> > If both are same u can be sure that the the number of NW calls is >>>> coming almost same. >>>> Here are the results for 10 000 gets and 10 000 scan.next(). Each time >>>> I access the result to be sure they are sent to the client. >>>> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds) >>>> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds) >>>> >>>> >[Block caching is enabled?] >>>> Good question. I don't know :( Is it enabled by default? How can I >>>> verify or activate it? >>>> >>>> > Also have you tried using Bloom filters? >>>> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) >>>> >>>> >>>> > What's the hbase version you're using? >>>> I manually installed 0.94.0. I can try with an other version. >>>> >>>> > Is it repeatable? >>>> Yes. I tries many many times by adding some options, closing some >>>> process on the server side, remonving one datanode, adding one, etc. I >>>> can see some small variations, but still in the same range. I was able >>>> to move from 200 rows/second to 300 rows/second. But that's not >>>> really a significant improvment. Also, here are the results for 7 >>>> iterations of the same code. >>>> >>>> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) >>>> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) >>>> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) >>>> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) >>>> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) >>>> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) >>>> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) >>>> >>>> >If the locations are wrong (region moved) you will have a retry loop >>>> I have one dead region. It's a server I brought down few days ago >>>> because it was to slow. But it's still on the hbase web interface. >>>> However, if I look at the table, there is no table region hosted on >>>> this server. Hadoop also was removed from it so it's saying one dead >>>> node. >>>> >>>> >Do you have anything in the logs? >>>> Nothing special. Only some "Block cache LRU eviction" entries. >>>> >>>> > Could you share as well the code >>>> Eveything is at the end of this post. >>>> >>>> >You can also check the cache hit and cache miss statistics that >>>> appears on >>>> the UI? >>>> Can you please tell me how I can find that? I was not able to find >>>> that on the web UI. Where should I look? >>>> >>>> > In your random scan how many Regions are scanned >>>> I only have 5 regions servers and 12 table regions. So I guess all the >>>> servers are called. >>>> >>>> >>>> So here is the code for the gets. I removed the KeyOnlyFilter because >>>> it's not improving the results. >>>> >>>> JM >>>> >>>> >>>> >>>> >>>> http://pastebin.com/K75nFiQk (for syntax highligthing) >>>> >>>> HTable table = new HTable(config, "test3"); >>>> >>>> for (int iteration = 0; iteration < 10; iteration++) >>>> { >>>> >>>> final int linesToRead = 1000; >>>> System.out.println(new java.util.Date () + " Processing iteration >>>> " + >>>> iteration + "... "); >>>> Vector<Get> gets = new Vector<Get>(linesToRead); >>>> >>>> for (long l = 0; l < linesToRead; l++) >>>> { >>>> byte[] array1 = new byte[24]; >>>> for (int i = 0; i < array1.length; i++) >>>> array1[i] = (byte)Math.floor(Math.random() * 256); >>>> Get g = new Get (array1); >>>> gets.addElement(g); >>>> >>>> processed++; >>>> } >>>> Object[] results = new Object[gets.size()]; >>>> >>>> long timeBefore = System.currentTimeMillis(); >>>> table.batch(gets, results); >>>> long timeAfter = System.currentTimeMillis(); >>>> >>>> float duration = timeAfter - timeBefore; >>>> System.out.println ("Time to read " + gets.size() + " lines : " + >>>> duration + " mseconds (" + Math.round(((float)linesToRead / (duration >>>> / 1000))) + " lines/seconds)"); >>>> >>>> >>>> for (int i = 0; i < results.length; i++) >>>> { >>>> if (results[i] instanceof KeyValue) >>>> if (!((KeyValue)results[i]).isEmptyColumn()) >>>> System.out.println("Result[" + i + "]: " + >>>> results[i]); // co >>>> BatchExample-9-Dump Print all results. >>>> } >>>> >>>> 2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com>: >>>> > Hi >>>> > >>>> > You can also check the cache hit and cache miss statistics that >>>> appears on >>>> > the UI? >>>> > >>>> > In your random scan how many Regions are scanned whereas in gets may >>>> be >>>> > many >>>> > due to randomness. >>>> > >>>> > Regards >>>> > Ram >>>> > >>>> >> -----Original Message----- >>>> >> From: N Keywal [mailto:nkey...@gmail.com] >>>> >> Sent: Thursday, June 28, 2012 2:00 PM >>>> >> To: user@hbase.apache.org >>>> >> Subject: Re: Scan vs Put vs Get >>>> >> >>>> >> Hi Jean-Marc, >>>> >> >>>> >> Interesting.... :-) >>>> >> >>>> >> Added to Anoop questions: >>>> >> >>>> >> What's the hbase version you're using? >>>> >> >>>> >> Is it repeatable, I mean if you try twice the same "gets" with the >>>> >> same client do you have the same results? I'm asking because the >>>> >> client caches the locations. >>>> >> >>>> >> If the locations are wrong (region moved) you will have a retry >>>> loop, >>>> >> and it includes a sleep. Do you have anything in the logs? >>>> >> >>>> >> Could you share as well the code you're using to get the ~100 ms >>>> time? >>>> >> >>>> >> Cheers, >>>> >> >>>> >> N. >>>> >> >>>> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoo...@huawei.com> >>>> >> wrote: >>>> >> > Hi >>>> >> > How many Gets you batch together in one call? Is this equal to >>>> >> the Scan#setCaching () that u are using? >>>> >> > If both are same u can be sure that the the number of NW calls is >>>> >> coming almost same. >>>> >> > >>>> >> > Also you are giving random keys in the Gets. The scan will be >>>> always >>>> >> sequential. Seems in your get scenario it is very very random reads >>>> >> resulting in too many reads of HFile block from HDFS. [Block caching >>>> is >>>> >> enabled?] >>>> >> > >>>> >> > Also have you tried using Bloom filters? ROW blooms might improve >>>> >> your get performance. >>>> >> > >>>> >> > -Anoop- >>>> >> > ________________________________________ >>>> >> > From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] >>>> >> > Sent: Thursday, June 28, 2012 5:04 AM >>>> >> > To: user >>>> >> > Subject: Scan vs Put vs Get >>>> >> > >>>> >> > Hi, >>>> >> > >>>> >> > I have a small piece of code, for testing, which is putting 1B >>>> lines >>>> >> > in an existing table, getting 3000 lines and scanning 10000. >>>> >> > >>>> >> > The table is one family, one column. >>>> >> > >>>> >> > Everything is done randomly. Put with Random key (24 bytes), fixed >>>> >> > family and fixed column names with random content (24 bytes). >>>> >> > >>>> >> > Get (batch) is done with random keys and scan with >>>> RandomRowFilter. >>>> >> > >>>> >> > And here are the results. >>>> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds) >>>> >> > That's correct for my needs based on the poor performances of the >>>> >> > servers in the cluster. I'm fine with the results. >>>> >> > >>>> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) >>>> >> > This is way to low. I don't understand why. So I tried the random >>>> >> scan >>>> >> > because I'm not able to figure the issue. >>>> >> > >>>> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds) >>>> >> > This it impressive! I have added that after I failed with the get. >>>> I >>>> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!! >>>> It's >>>> >> > awesome! >>>> >> > >>>> >> > However, I'm still wondering what's wrong with my gets. >>>> >> > >>>> >> > The code is very simple. I'm using Get objects that I'm executing >>>> in >>>> >> a >>>> >> > Batch. I tried to add a filter but it's not helping. Here is an >>>> >> > extract of the code. >>>> >> > >>>> >> > for (long l = 0; l < linesToRead; l++) >>>> >> > { >>>> >> > byte[] array1 = new byte[24]; >>>> >> > for (int i = 0; i < array1.length; >>>> >> i++) >>>> >> > array1[i] = >>>> >> (byte)Math.floor(Math.random() * 256); >>>> >> > Get g = new Get (array1); >>>> >> > gets.addElement(g); >>>> >> > } >>>> >> > Object[] results = new >>>> >> Object[gets.size()]; >>>> >> > System.out.println(new >>>> java.util.Date >>>> >> () + " \"gets\" created."); >>>> >> > long timeBefore = >>>> >> System.currentTimeMillis(); >>>> >> > table.batch(gets, results); >>>> >> > long timeAfter = >>>> System.currentTimeMillis(); >>>> >> > >>>> >> > float duration = timeAfter - timeBefore; >>>> >> > System.out.println ("Time to read " + >>>> >> gets.size() + " lines : " >>>> >> > + duration + " mseconds (" + Math.round(((float)linesToRead / >>>> >> > (duration / 1000))) + " lines/seconds)"); >>>> >> > >>>> >> > What's wrong with it? I can't add the setBatch neither I can add >>>> >> > setCaching because it's not a scan. I tried with different numbers >>>> of >>>> >> > gets but it's almost always the same speed. Am I using it the >>>> wrong >>>> >> > way? Does anyone have any advice to improve that? >>>> >> > >>>> >> > Thanks, >>>> >> > >>>> >> > JM >>>> > >>>> > >>> >>> >