RE: Scan vs Put vs Get
Hi You can also check the cache hit and cache miss statistics that appears on the UI? In your random scan how many Regions are scanned whereas in gets may be many due to randomness. Regards Ram -Original Message- From: N Keywal [mailto:nkey...@gmail.com] Sent: Thursday, June 28, 2012 2:00 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Hi Jean-Marc, Interesting :-) Added to Anoop questions: What's the hbase version you're using? Is it repeatable, I mean if you try twice the same gets with the same client do you have the same results? I'm asking because the client caches the locations. If the locations are wrong (region moved) you will have a retry loop, and it includes a sleep. Do you have anything in the logs? Could you share as well the code you're using to get the ~100 ms time? Cheers, N. On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John anoo...@huawei.com wrote: Hi How many Gets you batch together in one call? Is this equal to the Scan#setCaching () that u are using? If both are same u can be sure that the the number of NW calls is coming almost same. Also you are giving random keys in the Gets. The scan will be always sequential. Seems in your get scenario it is very very random reads resulting in too many reads of HFile block from HDFS. [Block caching is enabled?] Also have you tried using Bloom filters? ROW blooms might improve your get performance. -Anoop- From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:04 AM To: user Subject: Scan vs Put vs Get Hi, I have a small piece of code, for testing, which is putting 1B lines in an existing table, getting 3000 lines and scanning 1. The table is one family, one column. Everything is done randomly. Put with Random key (24 bytes), fixed family and fixed column names with random content (24 bytes). Get (batch) is done with random keys and scan with RandomRowFilter. And here are the results. Time to insert 100 lines: 43 seconds (23255 lines/seconds) That's correct for my needs based on the poor performances of the servers in the cluster. I'm fine with the results. Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) This is way to low. I don't understand why. So I tried the random scan because I'm not able to figure the issue. Time to read 1 lines: 108.0 mseconds (92593 lines/seconds) This it impressive! I have added that after I failed with the get. I moved from 262 lines per seconds to almost 100K lines/seconds!!! It's awesome! However, I'm still wondering what's wrong with my gets. The code is very simple. I'm using Get objects that I'm executing in a Batch. I tried to add a filter but it's not helping. Here is an extract of the code. for (long l = 0; l linesToRead; l++) { byte[] array1 = new byte[24]; for (int i = 0; i array1.length; i++) array1[i] = (byte)Math.floor(Math.random() * 256); Get g = new Get (array1); gets.addElement(g); } Object[] results = new Object[gets.size()]; System.out.println(new java.util.Date () + \gets\ created.); long timeBefore = System.currentTimeMillis(); table.batch(gets, results); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + gets.size() + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); What's wrong with it? I can't add the setBatch neither I can add setCaching because it's not a scan. I tried with different numbers of gets but it's almost always the same speed. Am I using it the wrong way? Does anyone have any advice to improve that? Thanks, JM
RE: Scan vs Put vs Get
In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read 10 lines : 181.0 mseconds (55 lines/seconds) Time to read 100 lines : 484.0 mseconds (207 lines/seconds) Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) Is this equal to the Scan#setCaching () that u are using? The scan call is done after the get test. So I can't set the cache for the scan before I do the gets. Also, I tried to run them separatly (On time only the put, one time only the get, etc.) so I did not find a way to setup the cache for the get. If both are same u can be sure that the the number of NW calls is coming almost same. Here are the results for 10 000 gets and 10 000 scan.next(). Each time I access the result to be sure they are sent to the client. (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds) (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds) [Block caching is enabled?] Good question. I don't know :( Is it enabled by default? How can I verify or activate it? Also have you tried using Bloom filters? Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) What's the hbase version you're using? I manually installed 0.94.0. I can try with an other version. Is it repeatable? Yes. I tries many many times by adding some options, closing some process on the server side, remonving one datanode, adding one, etc. I can see some small variations, but still in the same range. I was able to move from 200 rows/second to 300 rows/second. But that's not really a significant improvment. Also, here are the results for 7 iterations of the same code. Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) If the locations are wrong (region moved) you will have a retry loop I have one dead region. It's a server I brought down few days ago because it was to slow. But it's still on the hbase web interface. However, if I look at the table, there is no table region hosted on this server. Hadoop also was removed from it so it's saying one dead node. Do you have anything in the logs? Nothing special. Only some Block cache LRU eviction entries. Could you share as well the code Eveything is at the end of this post. You can also check the cache hit and cache miss statistics that appears on the UI? Can you please tell me how I can find that? I was not able to find that on the web UI. Where should I look? In your random scan how many Regions are scanned I only have 5 regions servers and 12 table regions. So I guess all the servers are called. So here is the code for the gets. I removed the KeyOnlyFilter because it's not improving the results. JM http://pastebin.com/K75nFiQk (for syntax highligthing) HTable table = new HTable(config, test3); for (int iteration = 0; iteration 10; iteration++) { final int linesToRead = 1000; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); VectorGet gets = new VectorGet(linesToRead); for (long l = 0; l linesToRead; l++) { byte[] array1 = new byte[24]; for (int i = 0; i array1.length; i++) array1[i] = (byte)Math.floor(Math.random() * 256); Get g = new Get (array1); gets.addElement(g); processed++; } Object[] results = new Object[gets.size()]; long timeBefore = System.currentTimeMillis(); table.batch(gets, results); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + gets.size() + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); for (int i = 0; i results.length; i++) { if (results[i] instanceof KeyValue) if (!((KeyValue)results[i]).isEmptyColumn()) System.out.println(Result[ + i + ]: + results[i]); // co BatchExample-9-Dump
Re: Scan vs Put vs Get
Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read: blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com: In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read 10 lines : 181.0 mseconds (55 lines/seconds) Time to read 100 lines : 484.0 mseconds (207 lines/seconds) Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) Is this equal to the Scan#setCaching () that u are using? The scan call is done after the get test. So I can't set the cache for the scan before I do the gets. Also, I tried to run them separatly (On time only the put, one time only the get, etc.) so I did not find a way to setup the cache for the get. If both are same u can be sure that the the number of NW calls is coming almost same. Here are the results for 10 000 gets and 10 000 scan.next(). Each time I access the result to be sure they are sent to the client. (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds) (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds) [Block caching is enabled?] Good question. I don't know :( Is it enabled by default? How can I verify or activate it? Also have you tried using Bloom filters? Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) What's the hbase version you're using? I manually installed 0.94.0. I can try with an other version. Is it repeatable? Yes. I tries many many times by adding some options, closing some process on the server side, remonving one datanode, adding one, etc. I can see some small variations, but still in the same range. I was able to move from 200 rows/second to 300 rows/second. But that's not really a significant improvment. Also, here are the results for 7 iterations of the same code. Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) If the locations are wrong (region moved) you will have a retry loop I have one dead region. It's a server I brought down few days ago because it was to slow. But it's still on the hbase web interface. However, if I look at the table, there is no table region hosted on this server. Hadoop also was removed from it so it's saying one dead node. Do you have anything in the logs? Nothing special. Only some Block cache LRU eviction entries. Could you share as well the code Eveything is at the end of this post. You can also check the cache hit and cache miss statistics that appears on the UI? Can you please tell me how I can find that? I was not able to find that on the web UI. Where should I look? In your random scan how many Regions are scanned I only have 5 regions servers and 12 table regions. So I guess all the servers are called. So here is the code for the gets. I removed the KeyOnlyFilter because it's not improving the results. JM http://pastebin.com/K75nFiQk (for syntax highligthing) HTable table = new HTable(config, test3); for (int iteration = 0; iteration 10; iteration++) { final int linesToRead = 1000; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); VectorGet gets = new VectorGet(linesToRead); for (long l = 0; l linesToRead; l
RE: Scan vs Put vs Get
blockCacheHitRatio=69% Seems blocks you are getting from cache. You can check with Blooms also once. You can enable the usage of bloom using the config param io.storefile.bloom.enabled set to true . This will enable the usage of bloom globally Now you need to set the bloom type for your CF HColumnDescriptor#setBloomFilterType() U can check with type BloomType.ROW -Anoop- _ From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:42 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read: blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com: In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read 10 lines : 181.0 mseconds (55 lines/seconds) Time to read 100 lines : 484.0 mseconds (207 lines/seconds) Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) Is this equal to the Scan#setCaching () that u are using? The scan call is done after the get test. So I can't set the cache for the scan before I do the gets. Also, I tried to run them separatly (On time only the put, one time only the get, etc.) so I did not find a way to setup the cache for the get. If both are same u can be sure that the the number of NW calls is coming almost same. Here are the results for 10 000 gets and 10 000 scan.next(). Each time I access the result to be sure they are sent to the client. (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds) (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds) [Block caching is enabled?] Good question. I don't know :( Is it enabled by default? How can I verify or activate it? Also have you tried using Bloom filters? Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) What's the hbase version you're using? I manually installed 0.94.0. I can try with an other version. Is it repeatable? Yes. I tries many many times by adding some options, closing some process on the server side, remonving one datanode, adding one, etc. I can see some small variations, but still in the same range. I was able to move from 200 rows/second to 300 rows/second. But that's not really a significant improvment. Also, here are the results for 7 iterations of the same code. Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) If the locations are wrong (region moved) you will have a retry loop I have one dead region. It's a server I brought down few days ago because it was to slow. But it's still on the hbase web interface. However, if I look at the table, there is no table region hosted on this server. Hadoop also was removed from it so it's saying one dead node. Do you have anything in the logs? Nothing special. Only some Block cache LRU eviction entries. Could you share as well the code Eveything is at the end of this post. You can also check the cache hit and cache miss statistics that appears on the UI? Can you please tell me how I can find that? I was not able to find that on the web UI. Where should I look? In your random scan how many Regions are scanned I only have 5 regions
Re: Scan vs Put vs Get
that appears on the UI? In your random scan how many Regions are scanned whereas in gets may be many due to randomness. Regards Ram -Original Message- From: N Keywal [mailto:nkey...@gmail.com] Sent: Thursday, June 28, 2012 2:00 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Hi Jean-Marc, Interesting :-) Added to Anoop questions: What's the hbase version you're using? Is it repeatable, I mean if you try twice the same gets with the same client do you have the same results? I'm asking because the client caches the locations. If the locations are wrong (region moved) you will have a retry loop, and it includes a sleep. Do you have anything in the logs? Could you share as well the code you're using to get the ~100 ms time? Cheers, N. On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John anoo...@huawei.com wrote: Hi How many Gets you batch together in one call? Is this equal to the Scan#setCaching () that u are using? If both are same u can be sure that the the number of NW calls is coming almost same. Also you are giving random keys in the Gets. The scan will be always sequential. Seems in your get scenario it is very very random reads resulting in too many reads of HFile block from HDFS. [Block caching is enabled?] Also have you tried using Bloom filters? ROW blooms might improve your get performance. -Anoop- From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:04 AM To: user Subject: Scan vs Put vs Get Hi, I have a small piece of code, for testing, which is putting 1B lines in an existing table, getting 3000 lines and scanning 1. The table is one family, one column. Everything is done randomly. Put with Random key (24 bytes), fixed family and fixed column names with random content (24 bytes). Get (batch) is done with random keys and scan with RandomRowFilter. And here are the results. Time to insert 100 lines: 43 seconds (23255 lines/seconds) That's correct for my needs based on the poor performances of the servers in the cluster. I'm fine with the results. Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) This is way to low. I don't understand why. So I tried the random scan because I'm not able to figure the issue. Time to read 1 lines: 108.0 mseconds (92593 lines/seconds) This it impressive! I have added that after I failed with the get. I moved from 262 lines per seconds to almost 100K lines/seconds!!! It's awesome! However, I'm still wondering what's wrong with my gets. The code is very simple. I'm using Get objects that I'm executing in a Batch. I tried to add a filter but it's not helping. Here is an extract of the code. for (long l = 0; l linesToRead; l++) { byte[] array1 = new byte[24]; for (int i = 0; i array1.length; i++) array1[i] = (byte)Math.floor(Math.random() * 256); Get g = new Get (array1); gets.addElement(g); } Object[] results = new Object[gets.size()]; System.out.println(new java.util.Date () + \gets\ created.); long timeBefore = System.currentTimeMillis(); table.batch(gets, results); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + gets.size() + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); What's wrong with it? I can't add the setBatch neither I can add setCaching because it's not a scan. I tried with different numbers of gets but it's almost always the same speed. Am I using it the wrong way? Does anyone have any advice to improve that? Thanks, JM
Re: Scan vs Put vs Get
Hi Anoop, Are Bloom filters for columns? If I add g.setFilter(new KeyOnlyFilter()); that mean I can't use bloom filters, right? Basically, what I'm doing here is something like existKey(byte[]):boolean where I try to see if a key exist in the database whitout taking into consideration if there is any column content or not. This should be very fast. Even faster than the scan which need to keep some tracks of where I'm reading for the next row. JM 2012/6/28, Anoop Sam John anoo...@huawei.com: blockCacheHitRatio=69% Seems blocks you are getting from cache. You can check with Blooms also once. You can enable the usage of bloom using the config param io.storefile.bloom.enabled set to true . This will enable the usage of bloom globally Now you need to set the bloom type for your CF HColumnDescriptor#setBloomFilterType() U can check with type BloomType.ROW -Anoop- _ From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:42 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read: blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com: In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read 10 lines : 181.0 mseconds (55 lines/seconds) Time to read 100 lines : 484.0 mseconds (207 lines/seconds) Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) Is this equal to the Scan#setCaching () that u are using? The scan call is done after the get test. So I can't set the cache for the scan before I do the gets. Also, I tried to run them separatly (On time only the put, one time only the get, etc.) so I did not find a way to setup the cache for the get. If both are same u can be sure that the the number of NW calls is coming almost same. Here are the results for 10 000 gets and 10 000 scan.next(). Each time I access the result to be sure they are sent to the client. (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds) (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds) [Block caching is enabled?] Good question. I don't know :( Is it enabled by default? How can I verify or activate it? Also have you tried using Bloom filters? Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) What's the hbase version you're using? I manually installed 0.94.0. I can try with an other version. Is it repeatable? Yes. I tries many many times by adding some options, closing some process on the server side, remonving one datanode, adding one, etc. I can see some small variations, but still in the same range. I was able to move from 200 rows/second to 300 rows/second. But that's not really a significant improvment. Also, here are the results for 7 iterations of the same code. Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) If the locations are wrong (region moved) you will have a retry loop I have one dead region. It's a server I brought down few days ago because it was to slow. But it's still on the hbase web interface. However, if I look at the table, there is no table region
Re: Scan vs Put vs Get
Hi N Keywal, This result: Time to read 1 lines : 122.0 mseconds (81967 lines/seconds) Is obtain with this code: HTable table = new HTable(config, test3); final int linesToRead = 1; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); scan.setFilter(rrf); scan.setFilter(kof); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { if (result != null) processed++; } scanner.close(); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + linesToRead + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); table.close (); This is with the scan. scan 80 000 lines/seconds put 20 000 lines/seconds get 300 lines/seconds 2012/6/28, Jean-Marc Spaggiari jean-m...@spaggiari.org: Hi Anoop, Are Bloom filters for columns? If I add g.setFilter(new KeyOnlyFilter()); that mean I can't use bloom filters, right? Basically, what I'm doing here is something like existKey(byte[]):boolean where I try to see if a key exist in the database whitout taking into consideration if there is any column content or not. This should be very fast. Even faster than the scan which need to keep some tracks of where I'm reading for the next row. JM 2012/6/28, Anoop Sam John anoo...@huawei.com: blockCacheHitRatio=69% Seems blocks you are getting from cache. You can check with Blooms also once. You can enable the usage of bloom using the config param io.storefile.bloom.enabled set to true . This will enable the usage of bloom globally Now you need to set the bloom type for your CF HColumnDescriptor#setBloomFilterType() U can check with type BloomType.ROW -Anoop- _ From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:42 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read: blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com: In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read 10 lines : 181.0 mseconds (55 lines/seconds) Time to read 100 lines : 484.0 mseconds (207 lines/seconds) Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) Is this equal to the Scan#setCaching () that u are using? The scan call is done after the get test. So I can't set the cache for the scan before I do the gets. Also, I tried to run them separatly (On time only the put, one time only the get, etc.) so I did not find a way to setup the cache for the get. If both are same u can be sure that the the number of NW calls is coming almost same. Here are the results for 10 000 gets and 10 000 scan.next(). Each time I access the result to be sure they are sent to the client. (gets) Time to read 1 lines : 36620.0 mseconds (273 lines/seconds) (scan) Time to read 1 lines : 119.0 mseconds (84034 lines/seconds) [Block caching is enabled?] Good question. I don't know :( Is it enabled by default? How can I verify or activate it? Also have you tried using Bloom filters? Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) What's the hbase
Re: Scan vs Put vs Get
Thank you. It's clearer now. From the code you sent, RandomRowFilter is not used. You're only using the KeyOnlyFilter (the second setFilter replaces the first one; you need to use like FilterList to combine filters). (Note as well that you would need to initialize RandomRowFilter#chance, if not all the rows will be filtered out.) So, in one case -list of gets-, you're reading a well defined set of rows (defined randomly, but well defined :-), and this set spreads all other the regions. In the second one (KeyOnlyFilter), you're reading the first 1K rows you could get from the cluster. This explains the difference between the results. Activating RandomRowFilter should not change much the results, as it's different to select a random set of rows and to get a set of rows defined randomly (don't know if I'm clear here...). Unfortunately you're likely to be more interested of the performance when there is a real selection. Your code for list of gets was correct imho. I'm interested by the results if you activate bloomfilters. Cheers, N. On Thu, Jun 28, 2012 at 3:45 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi N Keywal, This result: Time to read 1 lines : 122.0 mseconds (81967 lines/seconds) Is obtain with this code: HTable table = new HTable(config, test3); final int linesToRead = 1; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); scan.setFilter(rrf); scan.setFilter(kof); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { if (result != null) processed++; } scanner.close(); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + linesToRead + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); table.close (); This is with the scan. scan 80 000 lines/seconds put 20 000 lines/seconds get 300 lines/seconds 2012/6/28, Jean-Marc Spaggiari jean-m...@spaggiari.org: Hi Anoop, Are Bloom filters for columns? If I add g.setFilter(new KeyOnlyFilter()); that mean I can't use bloom filters, right? Basically, what I'm doing here is something like existKey(byte[]):boolean where I try to see if a key exist in the database whitout taking into consideration if there is any column content or not. This should be very fast. Even faster than the scan which need to keep some tracks of where I'm reading for the next row. JM 2012/6/28, Anoop Sam John anoo...@huawei.com: blockCacheHitRatio=69% Seems blocks you are getting from cache. You can check with Blooms also once. You can enable the usage of bloom using the config param io.storefile.bloom.enabled set to true . This will enable the usage of bloom globally Now you need to set the bloom type for your CF HColumnDescriptor#setBloomFilterType() U can check with type BloomType.ROW -Anoop- _ From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:42 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Oh! I never looked at this part ;) Ok. I have it. Here are the numbers for one server before the read: blockCacheSizeMB=186.28 blockCacheFreeMB=55.4 blockCacheCount=2923 blockCacheHitCount=195999 blockCacheMissCount=89297 blockCacheEvictedCount=69858 blockCacheHitRatio=68% blockCacheHitCachingRatio=72% And here are the numbers after 100 iterations of 1000 gets for the same server: blockCacheSizeMB=194.44 blockCacheFreeMB=47.25 blockCacheCount=3052 blockCacheHitCount=232034 blockCacheMissCount=103250 blockCacheEvictedCount=83682 blockCacheHitRatio=69% blockCacheHitCachingRatio=72% Don't forget that there is between 40B and 50B of lines in the table, so I don't think the servers can store all of them in memory. And since I'm accessing based on a random key, odds to have the right row in memory are small I think. JM 2012/6/28, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com: In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here. Regards Ram -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 4:44 PM To: user@hbase.apache.org Subject: Re: Scan vs Put vs Get Wow. First, thanks a lot all for jumping into this. Let me try to reply to everyone in a single post. How many Gets you batch together in one call I tried with multiple different values from 10 to 3000 with similar results. Time to read
Re: Scan vs Put vs Get
Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I mean, bad I did not figured that. Thanks for pointing that. That definitively explain the difference in the performances. I have activated the bloomfilters with this code: HBaseAdmin admin = new HBaseAdmin(config); HTable table = new HTable(config, test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0]; cd.setBloomFilterType(BloomType.ROW); admin.disableTable(test3); admin.modifyColumn(test3, cd); admin.enableTable(test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); And here is the result for the first attempt (using gets): {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0... Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds) 2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds) 3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds) After few more iterations (about 30), I'm between 200 and 250 lines/seconds, like before. Regarding the filterList, I tried, but now I'm getting this error from the servers: org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '-6376193724680783311' does not exist Here is the code: final int linesToRead = 1; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); ListFilter filters = new ArrayListFilter(); filters.add(rrf); filters.add(kof); FilterList filterList = new FilterList(filters); scan.setFilter(filterList); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { System.out.println(Result: + result); // if (result != null) processed++; } scanner.close(); It's failing when I try to do for (Result result : scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and 1 with the same result :( I will try to find the root cause, but if you have any hint, it's welcome. JM
Re: Scan vs Put vs Get
For the filter list my guess is that you're filtering out all rows because RandomRowFilter#chance is not initialized (it should be something like RandomRowFilter rrf = new RandomRowFilter(0.5);) But note that this test will never be comparable to the test with a list of gets. You can make it as slow/fast as you want by playing with the 'chance' parameter. The results with gets and bloom filter are also in the interesting category, hopefully an expert will get in the loop... On Thu, Jun 28, 2012 at 6:04 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I mean, bad I did not figured that. Thanks for pointing that. That definitively explain the difference in the performances. I have activated the bloomfilters with this code: HBaseAdmin admin = new HBaseAdmin(config); HTable table = new HTable(config, test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0]; cd.setBloomFilterType(BloomType.ROW); admin.disableTable(test3); admin.modifyColumn(test3, cd); admin.enableTable(test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); And here is the result for the first attempt (using gets): {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0... Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds) 2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds) 3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds) After few more iterations (about 30), I'm between 200 and 250 lines/seconds, like before. Regarding the filterList, I tried, but now I'm getting this error from the servers: org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '-6376193724680783311' does not exist Here is the code: final int linesToRead = 1; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); ListFilter filters = new ArrayListFilter(); filters.add(rrf); filters.add(kof); FilterList filterList = new FilterList(filters); scan.setFilter(filterList); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { System.out.println(Result: + result); // if (result != null) processed++; } scanner.close(); It's failing when I try to do for (Result result : scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and 1 with the same result :( I will try to find the root cause, but if you have any hint, it's welcome. JM
Re: Scan vs Put vs Get
Oh, sorry. You're right. You already said that and I forgot to update it. It's working fine when I add this parameter. And as you are saying, I can get the respons time I want by playing with the chance... I get (34758 lines/seconds) with 0.99 as the chance, and only (7564 lines/seconds) with 0.09... But that's still better than the gets. I just retried the gets, to see if the performances are changing after many table access, but results are still almost the same. I also tried to read 100 000 rows in a row with a random start key, and the performances are close to the random filter. (35273 lines/seconds). So it's really the get which is giving me an headache... 2012/6/28, N Keywal nkey...@gmail.com: For the filter list my guess is that you're filtering out all rows because RandomRowFilter#chance is not initialized (it should be something like RandomRowFilter rrf = new RandomRowFilter(0.5);) But note that this test will never be comparable to the test with a list of gets. You can make it as slow/fast as you want by playing with the 'chance' parameter. The results with gets and bloom filter are also in the interesting category, hopefully an expert will get in the loop... On Thu, Jun 28, 2012 at 6:04 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Oh! I see! KeyOnlyFilter is overwriting the RandomRowFilter! Bad. I mean, bad I did not figured that. Thanks for pointing that. That definitively explain the difference in the performances. I have activated the bloomfilters with this code: HBaseAdmin admin = new HBaseAdmin(config); HTable table = new HTable(config, test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); HColumnDescriptor cd = table.getTableDescriptor().getColumnFamilies()[0]; cd.setBloomFilterType(BloomType.ROW); admin.disableTable(test3); admin.modifyColumn(test3, cd); admin.enableTable(test3); System.out.println (table.getTableDescriptor().getColumnFamilies()[0]); And here is the result for the first attempt (using gets): {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} {NAME = 'cf', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'ROW', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} Thu Jun 28 11:08:59 EDT 2012 Processing iteration 0... Time to read 1000 lines : 40177.0 mseconds (25 lines/seconds) 2nd: Time to read 1000 lines : 7621.0 mseconds (131 lines/seconds) 3rd: Time to read 1000 lines : 7659.0 mseconds (131 lines/seconds) After few more iterations (about 30), I'm between 200 and 250 lines/seconds, like before. Regarding the filterList, I tried, but now I'm getting this error from the servers: org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '-6376193724680783311' does not exist Here is the code: final int linesToRead = 1; System.out.println(new java.util.Date () + Processing iteration + iteration + ... ); RandomRowFilter rrf = new RandomRowFilter(); KeyOnlyFilter kof = new KeyOnlyFilter(); Scan scan = new Scan(); ListFilter filters = new ArrayListFilter(); filters.add(rrf); filters.add(kof); FilterList filterList = new FilterList(filters); scan.setFilter(filterList); scan.setBatch(Math.min(linesToRead, 1000)); scan.setCaching(Math.min(linesToRead, 1000)); ResultScanner scanner = table.getScanner(scan); processed = 0; long timeBefore = System.currentTimeMillis(); for (Result result : scanner.next(linesToRead)) { System.out.println(Result: + result); // if (result != null) processed++; } scanner.close(); It's failing when I try to do for (Result result : scanner.next(linesToRead)). I tried with linesToRead=1000, 100, 10 and 1 with the same result :( I will try to find the root cause, but if you have any hint, it's welcome. JM
RE: Scan vs Put vs Get
Hi How many Gets you batch together in one call? Is this equal to the Scan#setCaching () that u are using? If both are same u can be sure that the the number of NW calls is coming almost same. Also you are giving random keys in the Gets. The scan will be always sequential. Seems in your get scenario it is very very random reads resulting in too many reads of HFile block from HDFS. [Block caching is enabled?] Also have you tried using Bloom filters? ROW blooms might improve your get performance. -Anoop- From: Jean-Marc Spaggiari [jean-m...@spaggiari.org] Sent: Thursday, June 28, 2012 5:04 AM To: user Subject: Scan vs Put vs Get Hi, I have a small piece of code, for testing, which is putting 1B lines in an existing table, getting 3000 lines and scanning 1. The table is one family, one column. Everything is done randomly. Put with Random key (24 bytes), fixed family and fixed column names with random content (24 bytes). Get (batch) is done with random keys and scan with RandomRowFilter. And here are the results. Time to insert 100 lines: 43 seconds (23255 lines/seconds) That's correct for my needs based on the poor performances of the servers in the cluster. I'm fine with the results. Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) This is way to low. I don't understand why. So I tried the random scan because I'm not able to figure the issue. Time to read 1 lines: 108.0 mseconds (92593 lines/seconds) This it impressive! I have added that after I failed with the get. I moved from 262 lines per seconds to almost 100K lines/seconds!!! It's awesome! However, I'm still wondering what's wrong with my gets. The code is very simple. I'm using Get objects that I'm executing in a Batch. I tried to add a filter but it's not helping. Here is an extract of the code. for (long l = 0; l linesToRead; l++) { byte[] array1 = new byte[24]; for (int i = 0; i array1.length; i++) array1[i] = (byte)Math.floor(Math.random() * 256); Get g = new Get (array1); gets.addElement(g); } Object[] results = new Object[gets.size()]; System.out.println(new java.util.Date () + \gets\ created.); long timeBefore = System.currentTimeMillis(); table.batch(gets, results); long timeAfter = System.currentTimeMillis(); float duration = timeAfter - timeBefore; System.out.println (Time to read + gets.size() + lines : + duration + mseconds ( + Math.round(((float)linesToRead / (duration / 1000))) + lines/seconds)); What's wrong with it? I can't add the setBatch neither I can add setCaching because it's not a scan. I tried with different numbers of gets but it's almost always the same speed. Am I using it the wrong way? Does anyone have any advice to improve that? Thanks, JM