Re: Optimizing Multi Gets in hbase
I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 1 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys. -- Lars From: Varun Sharma va...@pinterest.com To: user@hbase.apache.org Sent: Monday, February 18, 2013 1:57 AM Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: ListGet gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun
Re: HBase without compactions?
If you store data in LSM trees you need compactions. The advantage is that your data files are immutable. MapR has a mutable file system and they probably store their data in something more akin to B-Trees...? Or maybe they somehow avoid the expensive merge sorting of many small files. It seems that is has to be one or the other. (Maybe somebody from MapR reads this and can explain how it actually works.) Compations let you trade random IO for sequential IO (just to state the obvious). It seems that you can't have it both ways. -- Lars From: Otis Gospodnetic otis.gospodne...@gmail.com To: user@hbase.apache.org Sent: Monday, February 18, 2013 7:30 PM Subject: HBase without compactions? Hello, It's kind of funny, we run SPM, which includes SPM for HBase (performance monitoring service/tool for HBase essentially) and we currently store all performance metrics in HBase. I see a ton of HBase development activity, which is great, but it just occurred to me that I don't think I recall seeing anything about getting rid of compactions. Yet, compactions are one thing that I know hurt us the most and is one thing that MapR somehow got rid of in their implementation. Have there been any discussions,attempts, or thoughts about finding a way to avoid compactions? Thanks, Otis -- HBASE Performance Monitoring - http://sematext.com/spm/index.html
Re: Optimizing Multi Gets in hbase
Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 1 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys. -- Lars From: Varun Sharma va...@pinterest.com To: user@hbase.apache.org Sent: Monday, February 18, 2013 1:57 AM Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: ListGet gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun
Re: PreSplit the table with Long format
HBase shell is a jruby shell and so you can invoke any java commands from it. For example: import org.apache.hadoop.hbase.util.Bytes Bytes.toLong(Bytes.toBytes(1000)) Not sure if this works as expected since I don't have a terminal in front of me but you could try (assuming the SPLITS keyword takes byte array as input, never used SPLITS from the command line): create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000), Bytes.toBytes(2000), Bytes.toBytes(3000) ] } Thanks, Viral On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari mohandes.zebeleh...@gmail.com wrote: Hi there As I use rowkey in long format,I must presplit table in long format too.But when I've run this command,it presplit the table with STRING format : create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']} How can I presplit the table with Long format ? Farrokh
Re: PreSplit the table with Long format
Tnx for your help,but it doesn't work.Do you have any other idea,cause I must run it from the shell. Farrokh On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria viral.baja...@gmail.comwrote: HBase shell is a jruby shell and so you can invoke any java commands from it. For example: import org.apache.hadoop.hbase.util.Bytes Bytes.toLong(Bytes.toBytes(1000)) Not sure if this works as expected since I don't have a terminal in front of me but you could try (assuming the SPLITS keyword takes byte array as input, never used SPLITS from the command line): create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000), Bytes.toBytes(2000), Bytes.toBytes(3000) ] } Thanks, Viral On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari mohandes.zebeleh...@gmail.com wrote: Hi there As I use rowkey in long format,I must presplit table in long format too.But when I've run this command,it presplit the table with STRING format : create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']} How can I presplit the table with Long format ? Farrokh
Re: storing lists in columns
Hi Jean-Marc, I've validated this, it works perfectly. Very easy to implement and it's very fast! Thankfully in this project there isn't a lot of lists in each table, so I won't have to create too many column families. In other scenarios it could be a problem. Many thanks, Stas On 16 February 2013 02:29, Jean-Marc Spaggiari jean-m...@spaggiari.orgwrote: Hi Stas, Few options are coming into my mind. Quickly: 1) Why not storing the products in specif columns instead of in the same one? Like: table, rowid1, cf:list, c:aa, value:true table, rowid1, cf:list, c:bb, value:true table, rowid1, cf:list, c:cc, value:true table, rowid2, cf:list, c:aabb, value:true table, rowid2, cf:list, c:cc, value:true That way when you do a search you query directly the right column for the right row. And using exist call with also reduce the size of the data transfered. 2) You can store the data in the oposite way. Like: table, aa, cf:products, c:rowid1, value:true table, aabb, cf:products, c:rowid2, value:true table, bb, cf:products, c:rowid1, value:true table, cc, cf:products, c:rowid1, value:true table, cc, cf:products, c:rowid2, value:true Here, you query by your product ID, and you search the column based on your previous rowid. I will say the 2 solutions are equivalent, but it will really depend on your data pattern and you query pattern. JM 2013/2/15, Stas Maksimov maksi...@gmail.com: Hi all, I have a requirement to store lists in HBase columns like this: table, rowid1, f:list, aa, bb, cc table, rowid2, f:list, aabb, cc There is a further requirement to be able to find rows where f:list contains a particular item, e.g. when I need to find rows having item aa only rowid1 should match, and for item cc both rowid1 and rowid2 should match. For now I decided to use SingleColumnValueFilter with substring matching. As using comma-separated list proved difficult to search through, I'm using pipe symbols to separate items like this: |aa|bb|cc|, so that I could pass the search item surrounded by pipes into the filter: SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|') This proved to work effectively enough, however I would prefer to use something more standard for my list storage (e.g. serialised JSON), or perhaps something even more optimised for a search - performance really does matter here. Any opinions on this solution and possible enhancements are much appreciated. Many thanks, Stas
Re: storing lists in columns
Hi Stas, Don't forget that you should always try to keep the number of columns families lower than 3, else you might face some performances issues. JM 2013/2/19, Stas Maksimov maksi...@gmail.com: Hi Jean-Marc, I've validated this, it works perfectly. Very easy to implement and it's very fast! Thankfully in this project there isn't a lot of lists in each table, so I won't have to create too many column families. In other scenarios it could be a problem. Many thanks, Stas On 16 February 2013 02:29, Jean-Marc Spaggiari jean-m...@spaggiari.orgwrote: Hi Stas, Few options are coming into my mind. Quickly: 1) Why not storing the products in specif columns instead of in the same one? Like: table, rowid1, cf:list, c:aa, value:true table, rowid1, cf:list, c:bb, value:true table, rowid1, cf:list, c:cc, value:true table, rowid2, cf:list, c:aabb, value:true table, rowid2, cf:list, c:cc, value:true That way when you do a search you query directly the right column for the right row. And using exist call with also reduce the size of the data transfered. 2) You can store the data in the oposite way. Like: table, aa, cf:products, c:rowid1, value:true table, aabb, cf:products, c:rowid2, value:true table, bb, cf:products, c:rowid1, value:true table, cc, cf:products, c:rowid1, value:true table, cc, cf:products, c:rowid2, value:true Here, you query by your product ID, and you search the column based on your previous rowid. I will say the 2 solutions are equivalent, but it will really depend on your data pattern and you query pattern. JM 2013/2/15, Stas Maksimov maksi...@gmail.com: Hi all, I have a requirement to store lists in HBase columns like this: table, rowid1, f:list, aa, bb, cc table, rowid2, f:list, aabb, cc There is a further requirement to be able to find rows where f:list contains a particular item, e.g. when I need to find rows having item aa only rowid1 should match, and for item cc both rowid1 and rowid2 should match. For now I decided to use SingleColumnValueFilter with substring matching. As using comma-separated list proved difficult to search through, I'm using pipe symbols to separate items like this: |aa|bb|cc|, so that I could pass the search item surrounded by pipes into the filter: SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|') This proved to work effectively enough, however I would prefer to use something more standard for my list storage (e.g. serialised JSON), or perhaps something even more optimised for a search - performance really does matter here. Any opinions on this solution and possible enhancements are much appreciated. Many thanks, Stas
Table deleted after restart of computer
I just started with hbase. Therefore I created a table and filled this table with some data. But after restarting my computer all the data has gone. This even happens when stopping hbase with stop-hbase.sh. How can this happen?
Re: Table deleted after restart of computer
Which HBase / hadoop version were you using ? Did you start the cluster in standalone mode ? Thanks On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I just started with hbase. Therefore I created a table and filled this table with some data. But after restarting my computer all the data has gone. This even happens when stopping hbase with stop-hbase.sh. How can this happen?
Re: Table deleted after restart of computer
I installed hbase via brew. brew install hadoop hbase pig hive Then I started hbase via start-hbase.sh command. Therefore I'm pretty sure it is a standalone version. 2013/2/19 Ted Yu yuzhih...@gmail.com: Which HBase / hadoop version were you using ? Did you start the cluster in standalone mode ? Thanks On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I just started with hbase. Therefore I created a table and filled this table with some data. But after restarting my computer all the data has gone. This even happens when stopping hbase with stop-hbase.sh. How can this happen?
Re: Table deleted after restart of computer
Hello Paul, The default location for hbase data is /tmp so when you restart your machine it will be deleted, you need to change it as per http://hbase.apache.org/book.html#quickstart -- Ibrahim On Tue, Feb 19, 2013 at 5:54 PM, Ted Yu yuzhih...@gmail.com wrote: Which HBase / hadoop version were you using ? Did you start the cluster in standalone mode ? Thanks On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I just started with hbase. Therefore I created a table and filled this table with some data. But after restarting my computer all the data has gone. This even happens when stopping hbase with stop-hbase.sh. How can this happen?
Re: coprocessor enabled put very slow, help please~~~
A side question: if HTablePool is not encouraged to be used... how we handle the thread safeness in using HTable? Any replacement for HTablePool, in plan? Thanks, Best Regards, Wei From: Michel Segel michael_se...@hotmail.com To: user@hbase.apache.org user@hbase.apache.org, Date: 02/18/2013 09:23 AM Subject:Re: coprocessor enabled put very slow, help please~~~ Why are you using an HTable Pool? Why are you closing the table after each iteration through? Try using 1 HTable object. Turn off WAL Initiate in start() Close in Stop() Surround the use in a try / catch If exception caught, re instantiate new HTable connection. Maybe want to flush the connection after puts. Again not sure why you are using check and put on the base table. Your count could be off. As an example look at poem/rhyme 'Marry had a little lamb'. Then check your word count. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com wrote: Thank you guys for your replies, Michael, I think i didnt make it clear. Here is my use case, I have text documents to insert in the hbase. (With possible duplicates) Suppose i have a document as : I am working. He is not working I want to insert this document to a table in hbase, say table doc =doc table= - rowKey : doc_id cf: doc_content value: I am working. He is not working Now, i to create another table that stores the word count, say doc_idx doc_idx table --- rowKey : I, cf: count, value: 1 rowKey : am, cf: count, value: 1 rowKey : working, cf: count, value: 2 rowKey : He, cf: count, value: 1 rowKey : is, cf: count, value: 1 rowKey : not, cf: count, value: 1 My MR job code: == if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) { for(String word : doc_content.split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); } } Now, i wanted to do some experiments with coprocessors. So, i modified the code as follows. My MR job code: === doc.checkAndPut(rowKey, doc_content, , null, putDoc); Coprocessor code: === public void start(CoprocessorEnvironment env) { pool = new HTablePool(conf, 100); } public boolean postCheckAndPut(c, row, family, byte[] qualifier, compareOp, comparator, put, result) { if(!result) return true; // check if the put succeeded HTableInterface table_idx = pool.getTable(doc_idx); try { for(KeyValue contentKV = put.get(doc_content, )) { for(String word : contentKV.getValue().split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); table_idx.increment(inc); } } } finally { table_idx.close(); } return true; } public void stop(env) { pool.close(); } I am a newbee to HBASE. I am not sure this is the way to do. Given that, why is the cooprocessor enabled version much slower than the one without? Sincerely, Prakash Kadel On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel michael_se...@hotmail.com wrote: The issue I was talking about was the use of a check and put. The OP wrote: each map inserts to doc table.(checkAndPut) regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table. My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut? Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. He should just be using put() and then a postPut(). Another issue... since he's writing to a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? Without seeing his code, we don't know. Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted. Interestingly enough, you may want to consider disabling the WAL on the write to the index. You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data. Indexes *ARE* expendable. ;-) Does that explain it? -Mike On Feb 18, 2013, at 4:57 AM, yonghu yongyong...@gmail.com wrote: Hi, Michael I don't quite understand what do you mean by round trip back to the client. In my understanding, as the RegionServer and TaskTracker can be the same node, MR don't have to pull data
Re: Optimizing Multi Gets in hbase
I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 1 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys. -- Lars From: Varun Sharma va...@pinterest.com To: user@hbase.apache.org Sent: Monday, February 18, 2013 1:57 AM Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: ListGet gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is
Re: coprocessor enabled put very slow, help please~~~
Good question.. You create a class MyRO. How many instances of MyRO exist per RS? How many queries can access the instance MyRO at the same time? On Feb 19, 2013, at 9:15 AM, Wei Tan w...@us.ibm.com wrote: A side question: if HTablePool is not encouraged to be used... how we handle the thread safeness in using HTable? Any replacement for HTablePool, in plan? Thanks, Best Regards, Wei From: Michel Segel michael_se...@hotmail.com To: user@hbase.apache.org user@hbase.apache.org, Date: 02/18/2013 09:23 AM Subject:Re: coprocessor enabled put very slow, help please~~~ Why are you using an HTable Pool? Why are you closing the table after each iteration through? Try using 1 HTable object. Turn off WAL Initiate in start() Close in Stop() Surround the use in a try / catch If exception caught, re instantiate new HTable connection. Maybe want to flush the connection after puts. Again not sure why you are using check and put on the base table. Your count could be off. As an example look at poem/rhyme 'Marry had a little lamb'. Then check your word count. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com wrote: Thank you guys for your replies, Michael, I think i didnt make it clear. Here is my use case, I have text documents to insert in the hbase. (With possible duplicates) Suppose i have a document as : I am working. He is not working I want to insert this document to a table in hbase, say table doc =doc table= - rowKey : doc_id cf: doc_content value: I am working. He is not working Now, i to create another table that stores the word count, say doc_idx doc_idx table --- rowKey : I, cf: count, value: 1 rowKey : am, cf: count, value: 1 rowKey : working, cf: count, value: 2 rowKey : He, cf: count, value: 1 rowKey : is, cf: count, value: 1 rowKey : not, cf: count, value: 1 My MR job code: == if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) { for(String word : doc_content.split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); } } Now, i wanted to do some experiments with coprocessors. So, i modified the code as follows. My MR job code: === doc.checkAndPut(rowKey, doc_content, , null, putDoc); Coprocessor code: === public void start(CoprocessorEnvironment env) { pool = new HTablePool(conf, 100); } public boolean postCheckAndPut(c, row, family, byte[] qualifier, compareOp, comparator, put, result) { if(!result) return true; // check if the put succeeded HTableInterface table_idx = pool.getTable(doc_idx); try { for(KeyValue contentKV = put.get(doc_content, )) { for(String word : contentKV.getValue().split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); table_idx.increment(inc); } } } finally { table_idx.close(); } return true; } public void stop(env) { pool.close(); } I am a newbee to HBASE. I am not sure this is the way to do. Given that, why is the cooprocessor enabled version much slower than the one without? Sincerely, Prakash Kadel On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel michael_se...@hotmail.com wrote: The issue I was talking about was the use of a check and put. The OP wrote: each map inserts to doc table.(checkAndPut) regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table. My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut? Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. He should just be using put() and then a postPut(). Another issue... since he's writing to a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? Without seeing his code, we don't know. Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted. Interestingly enough, you may want to consider disabling the WAL on the write to the index. You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data. Indexes *ARE* expendable. ;-) Does that explain it? -Mike On Feb 18, 2013, at
Rowkey design question
Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: Rowkey design question
Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is distributed evenly on all the RSs. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: Rowkey design question
Hey Tariq, thanks for your quick answer. I'm not sure if I got the idea in the seond part of your answer. You mean if I use a timestamp as a rowkey I should append a hash like this: 135727920+MD5HASH and then the data would be distributed more equally? 2013/2/19 Mohammad Tariq donta...@gmail.com: Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is distributed evenly on all the RSs. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: coprocessor enabled put very slow, help please~~~
I should follow up with that I was asking why he was using an HTable Pool, not saying that it was wrong. Still. I think in the pool the writes shouldn't have to go to the WAL. On Feb 19, 2013, at 10:01 AM, Michael Segel michael_se...@hotmail.com wrote: Good question.. You create a class MyRO. How many instances of MyRO exist per RS? How many queries can access the instance MyRO at the same time? On Feb 19, 2013, at 9:15 AM, Wei Tan w...@us.ibm.com wrote: A side question: if HTablePool is not encouraged to be used... how we handle the thread safeness in using HTable? Any replacement for HTablePool, in plan? Thanks, Best Regards, Wei From: Michel Segel michael_se...@hotmail.com To: user@hbase.apache.org user@hbase.apache.org, Date: 02/18/2013 09:23 AM Subject:Re: coprocessor enabled put very slow, help please~~~ Why are you using an HTable Pool? Why are you closing the table after each iteration through? Try using 1 HTable object. Turn off WAL Initiate in start() Close in Stop() Surround the use in a try / catch If exception caught, re instantiate new HTable connection. Maybe want to flush the connection after puts. Again not sure why you are using check and put on the base table. Your count could be off. As an example look at poem/rhyme 'Marry had a little lamb'. Then check your word count. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com wrote: Thank you guys for your replies, Michael, I think i didnt make it clear. Here is my use case, I have text documents to insert in the hbase. (With possible duplicates) Suppose i have a document as : I am working. He is not working I want to insert this document to a table in hbase, say table doc =doc table= - rowKey : doc_id cf: doc_content value: I am working. He is not working Now, i to create another table that stores the word count, say doc_idx doc_idx table --- rowKey : I, cf: count, value: 1 rowKey : am, cf: count, value: 1 rowKey : working, cf: count, value: 2 rowKey : He, cf: count, value: 1 rowKey : is, cf: count, value: 1 rowKey : not, cf: count, value: 1 My MR job code: == if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) { for(String word : doc_content.split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); } } Now, i wanted to do some experiments with coprocessors. So, i modified the code as follows. My MR job code: === doc.checkAndPut(rowKey, doc_content, , null, putDoc); Coprocessor code: === public void start(CoprocessorEnvironment env) { pool = new HTablePool(conf, 100); } public boolean postCheckAndPut(c, row, family, byte[] qualifier, compareOp, comparator, put, result) { if(!result) return true; // check if the put succeeded HTableInterface table_idx = pool.getTable(doc_idx); try { for(KeyValue contentKV = put.get(doc_content, )) { for(String word : contentKV.getValue().split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); table_idx.increment(inc); } } } finally { table_idx.close(); } return true; } public void stop(env) { pool.close(); } I am a newbee to HBASE. I am not sure this is the way to do. Given that, why is the cooprocessor enabled version much slower than the one without? Sincerely, Prakash Kadel On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel michael_se...@hotmail.com wrote: The issue I was talking about was the use of a check and put. The OP wrote: each map inserts to doc table.(checkAndPut) regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table. My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut? Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. He should just be using put() and then a postPut(). Another issue... since he's writing to a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? Without seeing his code, we don't know. Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted. Interestingly enough, you may want to consider disabling the
Re: Optimizing Multi Gets in hbase
Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote: I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 1 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter,
Re: Using HBase for Deduping
I could surround with a Try..Catch, but that would each time I insert a UUID for the first time (99% of the time), I would do a checkAndPut(), catch the resultant exception and perform a Put; so, 2 operations each reduce invocation, which is what I was looking to avoid From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com Sent: Friday, February 15, 2013 9:24 AM Subject: Re: Using HBase for Deduping Interesting. Surround with a Try Catch? But it sounds like you're on the right path. Happy Coding! On Feb 15, 2013, at 11:12 AM, Rahul Ravindran rahu...@yahoo.com wrote: I had tried checkAndPut yesterday with a null passed as the value and it had thrown an exception when the row did not exist. Perhaps, I was doing something wrong. Will try that again, since, yes, I would prefer a checkAndPut(). From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org Cc: Rahul Ravindran rahu...@yahoo.com Sent: Friday, February 15, 2013 4:36 AM Subject: Re: Using HBase for Deduping On Feb 15, 2013, at 3:07 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Michael, this means read for every write? Yes and no. At the macro level, a read for every write would mean that your client would read a record from HBase, and then based on some logic it would either write a record, or not. So that you have a lot of overhead in the initial get() and then put(). At this macro level, with a Check and Put you have less overhead because of a single message to HBase. Intermal to HBase, you would still have to check the value in the row, if it exists and then perform an insert or not. WIth respect to your billion events an hour... dividing by 3600 to get the number of events in a second. You would have less than 300,000 events a second. What exactly are you doing and how large are those events? Since you are processing these events in a batch job, timing doesn't appear to be that important and of course there is also async hbase which may improve some of the performance. YMMV but this is a good example of the checkAndPut() On Friday, February 15, 2013, Michael Segel wrote: What constitutes a duplicate? An over simplification is to do a HTable.checkAndPut() where you do the put if the column doesn't exist. Then if the row is inserted (TRUE) return value, you push the event. That will do what you want. At least at first blush. On Feb 14, 2013, at 3:24 PM, Viral Bajaria viral.baja...@gmail.com wrote: Given the size of the data ( 1B rows) and the frequency of job run (once per hour), I don't think your most optimal solution is to lookup HBase for every single event. You will benefit more by loading the HBase table directly in your MR job. In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ? Also once you have done the unique, are you going to use the data again in some other way i.e. online serving of traffic or some other analysis ? Or this is just to compute some unique #'s ? It will be more helpful if you describe your final use case of the computed data too. Given the amount of back and forth, we can take it off list too and summarize the conversation for the list. On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran rahu...@yahoo.com wrote: We can't rely on the the assumption event dupes will not dupe outside an hour boundary. So, your take is that, doing a lookup per event within the MR job is going to be bad? From: Viral Bajaria viral.baja...@gmail.com To: Rahul Ravindran rahu...@yahoo.com Cc: user@hbase.apache.org user@hbase.apache.org Sent: Thursday, February 14, 2013 12:48 PM Subject: Re: Using HBase for Deduping You could do with a 2-pronged approach here i.e. some MR and some HBase lookups. I don't think this is the best solution either given the # of events you will get. FWIW, the solution below again relies on the assumption that if a event is duped in the same hour it won't have a dupe outside of that hour boundary. If it can have then you are better of with running a MR job with the current hour + another 3 hours of data or an MR job with the current hour + the HBase table as input to the job too (i.e. no HBase lookups, just read the HFile directly) ? - Run a MR job which de-dupes events for the current hour i.e. only runs on 1 hour worth of data. - Mark records which you were not able to de-dupe in the current run - For the records that you were not able to de-dupe, check against HBase whether you saw that event in the past. If you did, you can drop the current event or update the event to the new value (based on your business logic) - Save all the de-duped events (via HBase bulk upload) Sorry if I just rambled along, but without knowing the whole problem it's very tough to come up with a probable solution. So correct my assumptions and we could drill down more.
Re: Co-Processor in scanning the HBase's Table
Thanks you guys On Mon, Feb 18, 2013 at 12:00 PM, Amit Sela am...@infolinks.com wrote: Yes... that was emailing half asleep... :) On Mon, Feb 18, 2013 at 7:23 AM, Anoop Sam John anoo...@huawei.com wrote: We dont have any hook like postScan().. In ur case you can try with postScannerClose().. This will be called once per region. When the scan on that region is over the scanner opened on that region will get closed and at that time this hook will get executed. -Anoop- From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com] Sent: Monday, February 18, 2013 10:27 AM To: user@hbase.apache.org Cc: cdh-u...@cloudera.org Subject: Re: Co-Processor in scanning the HBase's Table Thanks you Amit,I will check that. @Anoop: I wanna run that just after scanning a region or after scanning the regions that to belong to one regionserver. On Mon, Feb 18, 2013 at 7:45 AM, Anoop Sam John anoo...@huawei.com wrote: I wanna use a custom code after scanning a large table and prefer to run the code after scanning each region Exactly at what point you want to run your custom code? We have hooks at points like opening a scanner at a region, closing scanner at a region, calling next (pre/post) etc -Anoop- From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com] Sent: Monday, February 18, 2013 12:21 AM To: cdh-u...@cloudera.org; user@hbase.apache.org Subject: Co-Processor in scanning the HBase's Table Hi there I wanna use a custom code after scanning a large table and prefer to run the code after scanning each region.I know that I should use co-processor,but don't know which of Observer,Endpoint or both of them I should use ? Is there any simple example of them ? Tnx
Re: Rowkey design question
No. before the timestamp. All the row keys which are identical go to the same region. This is the default Hbase behavior and is meant to make the performance better. But sometimes the machine gets overloaded with reads and writes because we get concentrated on that particular machine. For example timeseries data. So it's better to hash the keys in order to make them go to all the machines equally. HTH BTW, did that range query work?? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hey Tariq, thanks for your quick answer. I'm not sure if I got the idea in the seond part of your answer. You mean if I use a timestamp as a rowkey I should append a hash like this: 135727920+MD5HASH and then the data would be distributed more equally? 2013/2/19 Mohammad Tariq donta...@gmail.com: Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is distributed evenly on all the RSs. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: Rowkey design question
Yeah it worked fine. But as I understand: If I prefix my row key with something like md5-hash + timestamp then the rowkeys are probably evenly distributed but how would I perform then a scan restricted to a special time range? 2013/2/19 Mohammad Tariq donta...@gmail.com: No. before the timestamp. All the row keys which are identical go to the same region. This is the default Hbase behavior and is meant to make the performance better. But sometimes the machine gets overloaded with reads and writes because we get concentrated on that particular machine. For example timeseries data. So it's better to hash the keys in order to make them go to all the machines equally. HTH BTW, did that range query work?? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hey Tariq, thanks for your quick answer. I'm not sure if I got the idea in the seond part of your answer. You mean if I use a timestamp as a rowkey I should append a hash like this: 135727920+MD5HASH and then the data would be distributed more equally? 2013/2/19 Mohammad Tariq donta...@gmail.com: Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is distributed evenly on all the RSs. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: Rowkey design question
You can use FuzzyRowFilterhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FuzzyRowFilter.htmlto do that. Have a look at this linkhttp://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/. You might find it helpful. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 11:20 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Yeah it worked fine. But as I understand: If I prefix my row key with something like md5-hash + timestamp then the rowkeys are probably evenly distributed but how would I perform then a scan restricted to a special time range? 2013/2/19 Mohammad Tariq donta...@gmail.com: No. before the timestamp. All the row keys which are identical go to the same region. This is the default Hbase behavior and is meant to make the performance better. But sometimes the machine gets overloaded with reads and writes because we get concentrated on that particular machine. For example timeseries data. So it's better to hash the keys in order to make them go to all the machines equally. HTH BTW, did that range query work?? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hey Tariq, thanks for your quick answer. I'm not sure if I got the idea in the seond part of your answer. You mean if I use a timestamp as a rowkey I should append a hash like this: 135727920+MD5HASH and then the data would be distributed more equally? 2013/2/19 Mohammad Tariq donta...@gmail.com: Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is distributed evenly on all the RSs. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW COLUMN+CELL 135702000+192.168.178.9column=CF:SampleCol, timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00 Since I got several rows for different timestamps I'd like to tell a scan to just a region of the table for example from 2013-01-07 to 2013-01-09. Previously I only had a timestamp as the rowkey and I could restrict the rowkey like that: SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss); Date startDate = formatter.parse(2013-01-07 07:00:00); Date endDate = formatter.parse(2013-01-10 07:00:00); HTableInterface toyDataTable = pool.getTable(ToyDataTable); Scan scan = new Scan( Bytes.toBytes( startDate.getTime() ), Bytes.toBytes( endDate.getTime() ) ); But this no longer works with my new design. Is there a way to tell the scan object to filter the rows with respect to the timestamp, or do I have to use a filter object?
Re: Optimizing Multi Gets in hbase
The other suggestion, sounds better to me where the multi call is modified to run the Get(s) with this new filter or just initiate a scan with all the get(s). Since the client automatically groups the multi calls by region server and only calls the respective region servers. That would eliminate calling all region servers. This might be an interesting benchmark to run. On Tue, Feb 19, 2013 at 9:28 AM, Nicolas Liochon nkey...@gmail.com wrote: Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote: I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 1 random Gets,
Re: Optimizing Multi Gets in hbase
I was thinking along the same lines. Doing a skip scan via filter hinting. The problem is as you say that the Filter is instantiated everywhere and it might be of significant size (have to maintain all row keys you are looking for). RegionScanner now a reseek method, it is possible to do this via a coprocessor. They are also loaded per region (but at least not for each store), and one can use the shared coproc state I added to alleviate the memory concern. Thinking about this in terms of multiple scan is interesting. One could identify clusters of close row keys in the Gets and issue a Scan for each cluster. -- Lars From: Nicolas Liochon nkey...@gmail.com To: user user@hbase.apache.org Sent: Tuesday, February 19, 2013 9:28 AM Subject: Re: Optimizing Multi Gets in hbase Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote: I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the
Re: Optimizing Multi Gets in hbase
Interesting, in the client we're doing a group by location the multiget. So we could have the filter as HBase core code, and then we could use it in the client for the multiget: compared to my initial proposal, we don't have to change anything in the server code and we reuse the filtering framework. The filter can be also be used independently. Is there any issue with this? The reseek seems to be quite smart in the way it handles the bloom filters, I don't know if it behaves differently in this case vs. a simple get. On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl la...@apache.org wrote: I was thinking along the same lines. Doing a skip scan via filter hinting. The problem is as you say that the Filter is instantiated everywhere and it might be of significant size (have to maintain all row keys you are looking for). RegionScanner now a reseek method, it is possible to do this via a coprocessor. They are also loaded per region (but at least not for each store), and one can use the shared coproc state I added to alleviate the memory concern. Thinking about this in terms of multiple scan is interesting. One could identify clusters of close row keys in the Gets and issue a Scan for each cluster. -- Lars From: Nicolas Liochon nkey...@gmail.com To: user user@hbase.apache.org Sent: Tuesday, February 19, 2013 9:28 AM Subject: Re: Optimizing Multi Gets in hbase Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote: I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars From: lars hofhansl la...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Monday, February 18, 2013
Re: Optimizing Multi Gets in hbase
As well, an advantage of going only to the servers needed is the famous MTTR: there are a less chance to go to a dead server or to a region that just moved. On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon nkey...@gmail.com wrote: Interesting, in the client we're doing a group by location the multiget. So we could have the filter as HBase core code, and then we could use it in the client for the multiget: compared to my initial proposal, we don't have to change anything in the server code and we reuse the filtering framework. The filter can be also be used independently. Is there any issue with this? The reseek seems to be quite smart in the way it handles the bloom filters, I don't know if it behaves differently in this case vs. a simple get. On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl la...@apache.org wrote: I was thinking along the same lines. Doing a skip scan via filter hinting. The problem is as you say that the Filter is instantiated everywhere and it might be of significant size (have to maintain all row keys you are looking for). RegionScanner now a reseek method, it is possible to do this via a coprocessor. They are also loaded per region (but at least not for each store), and one can use the shared coproc state I added to alleviate the memory concern. Thinking about this in terms of multiple scan is interesting. One could identify clusters of close row keys in the Gets and issue a Scan for each cluster. -- Lars From: Nicolas Liochon nkey...@gmail.com To: user user@hbase.apache.org Sent: Tuesday, February 19, 2013 9:28 AM Subject: Re: Optimizing Multi Gets in hbase Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote: I have another question, if I am running a scan wrapped around multiple rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote: Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote: I should qualify that statement, actually. I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is = 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that
Scanning a row for certain md5hash does not work
I'm currently reading a book about hbase (hbase in action by manning). In this book it is explained how to perform a scan if the rowkey is made out of a md5 hash (page 45 in the book). My rowkey design (and table filling method) looks like this: SimpleDateFormat dateFormatter = new SimpleDateFormat(-MM-dd); SimpleDateFormat timeFormatter = new SimpleDateFormat(HH:mm:ss); Date date = dateFormatter.parse(2013-01-01); for( int i = 0; i 31; ++i ) { for( int k = 0; k 24; ++k ) { for( int j = 0; j 1; ++j ) { //md5() is a custom method that transforms a string into a md5 hash byte[] ts = md5( dateFormatter.format(date) ); byte[] tm = md5( timeFormatter.format(date) ); byte[] ip = md5( generateRandomIPAddress() /* toy method that generates ip addresses */ ); byte[] rowkey = new byte[ ts.length + tm.length + ip.length ]; System.arraycopy( ts, 0, rowkey, 0, ts.length ); System.arraycopy( tm, 0, rowkey, ts.length, tm.length ); System.arraycopy( ip, 0, rowkey, ts.length+tm.length, ip.length ); Put p = new Put( rowkey ); p.add( Bytes.toBytes(CF), Bytes.toBytes(SampleCol), Bytes.toBytes( Value_ + (i+1) + = + dateFormatter.format(date) + + timeFormatter.format(date) ) ); toyDataTable.put( p ); } //custom method that adds an hour to the current date object date = addHours( date, 1 ); } } Now I'd like to do the following scan (I more or less took the same code from the example in the book): SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd); Date refDate = formatter.parse(2013-01-15); HTableInterface toyDataTable = pool.getTable(ToyDataTable); byte[] md5Key = md5( refDate.getTime() + ); int md5Length = 16; int longLength = 8; byte[] startRow = Bytes.padTail( md5Key, longLength ); byte[] endRow = Bytes.padTail( md5Key, longLength ); endRow[md5Length-1]++; Scan scan = new Scan( startRow, endRow ); ResultScanner rs = toyDataTable.getScanner( scan ); for( Result r : rs ) { String value = Bytes.toString( r.getValue( Bytes.toBytes(CF), Bytes.toBytes(SampleCol)) ); System.out.println( value ); } The result is empty. How is that possible?
Re: Scanning a row for certain md5hash does not work
Sorry, I had a mistake in my rowkey generation. Thanks for reading! 2013/2/19 Paul van Hoven paul.van.ho...@googlemail.com: I'm currently reading a book about hbase (hbase in action by manning). In this book it is explained how to perform a scan if the rowkey is made out of a md5 hash (page 45 in the book). My rowkey design (and table filling method) looks like this: SimpleDateFormat dateFormatter = new SimpleDateFormat(-MM-dd); SimpleDateFormat timeFormatter = new SimpleDateFormat(HH:mm:ss); Date date = dateFormatter.parse(2013-01-01); for( int i = 0; i 31; ++i ) { for( int k = 0; k 24; ++k ) { for( int j = 0; j 1; ++j ) { //md5() is a custom method that transforms a string into a md5 hash byte[] ts = md5( dateFormatter.format(date) ); byte[] tm = md5( timeFormatter.format(date) ); byte[] ip = md5( generateRandomIPAddress() /* toy method that generates ip addresses */ ); byte[] rowkey = new byte[ ts.length + tm.length + ip.length ]; System.arraycopy( ts, 0, rowkey, 0, ts.length ); System.arraycopy( tm, 0, rowkey, ts.length, tm.length ); System.arraycopy( ip, 0, rowkey, ts.length+tm.length, ip.length ); Put p = new Put( rowkey ); p.add( Bytes.toBytes(CF), Bytes.toBytes(SampleCol), Bytes.toBytes( Value_ + (i+1) + = + dateFormatter.format(date) + + timeFormatter.format(date) ) ); toyDataTable.put( p ); } //custom method that adds an hour to the current date object date = addHours( date, 1 ); } } Now I'd like to do the following scan (I more or less took the same code from the example in the book): SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd); Date refDate = formatter.parse(2013-01-15); HTableInterface toyDataTable = pool.getTable(ToyDataTable); byte[] md5Key = md5( refDate.getTime() + ); int md5Length = 16; int longLength = 8; byte[] startRow = Bytes.padTail( md5Key, longLength ); byte[] endRow = Bytes.padTail( md5Key, longLength ); endRow[md5Length-1]++; Scan scan = new Scan( startRow, endRow ); ResultScanner rs = toyDataTable.getScanner( scan ); for( Result r : rs ) { String value = Bytes.toString( r.getValue( Bytes.toBytes(CF), Bytes.toBytes(SampleCol)) ); System.out.println( value ); } The result is empty. How is that possible?
Re: coprocessor enabled put very slow, help please~~~
A coprocessor is some code running in a server process. The resources available and rules of the road are different from client side programming. HTablePool (and HTable in general) is problematic for server side programming in my opinion: http://search-hadoop.com/m/XtAi5Fogw32 Since this comes up now and again seems like a lightweight alternative for server side IPC could be useful. On Tue, Feb 19, 2013 at 7:15 AM, Wei Tan w...@us.ibm.com wrote: A side question: if HTablePool is not encouraged to be used... how we handle the thread safeness in using HTable? Any replacement for HTablePool, in plan? Thanks, Best Regards, Wei From: Michel Segel michael_se...@hotmail.com To: user@hbase.apache.org user@hbase.apache.org, Date: 02/18/2013 09:23 AM Subject:Re: coprocessor enabled put very slow, help please~~~ Why are you using an HTable Pool? Why are you closing the table after each iteration through? Try using 1 HTable object. Turn off WAL Initiate in start() Close in Stop() Surround the use in a try / catch If exception caught, re instantiate new HTable connection. Maybe want to flush the connection after puts. Again not sure why you are using check and put on the base table. Your count could be off. As an example look at poem/rhyme 'Marry had a little lamb'. Then check your word count. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com wrote: Thank you guys for your replies, Michael, I think i didnt make it clear. Here is my use case, I have text documents to insert in the hbase. (With possible duplicates) Suppose i have a document as : I am working. He is not working I want to insert this document to a table in hbase, say table doc =doc table= - rowKey : doc_id cf: doc_content value: I am working. He is not working Now, i to create another table that stores the word count, say doc_idx doc_idx table --- rowKey : I, cf: count, value: 1 rowKey : am, cf: count, value: 1 rowKey : working, cf: count, value: 2 rowKey : He, cf: count, value: 1 rowKey : is, cf: count, value: 1 rowKey : not, cf: count, value: 1 My MR job code: == if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) { for(String word : doc_content.split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); } } Now, i wanted to do some experiments with coprocessors. So, i modified the code as follows. My MR job code: === doc.checkAndPut(rowKey, doc_content, , null, putDoc); Coprocessor code: === public void start(CoprocessorEnvironment env) { pool = new HTablePool(conf, 100); } public boolean postCheckAndPut(c, row, family, byte[] qualifier, compareOp, comparator, put, result) { if(!result) return true; // check if the put succeeded HTableInterface table_idx = pool.getTable(doc_idx); try { for(KeyValue contentKV = put.get(doc_content, )) { for(String word : contentKV.getValue().split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); table_idx.increment(inc); } } } finally { table_idx.close(); } return true; } public void stop(env) { pool.close(); } I am a newbee to HBASE. I am not sure this is the way to do. Given that, why is the cooprocessor enabled version much slower than the one without? Sincerely, Prakash Kadel On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel michael_se...@hotmail.com wrote: The issue I was talking about was the use of a check and put. The OP wrote: each map inserts to doc table.(checkAndPut) regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table. My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut? Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. He should just be using put() and then a postPut(). Another issue... since he's writing to a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? Without seeing his code, we don't know. Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
Re: coprocessor enabled put very slow, help please~~~
1. Try batching your increment calls to a ListRow and use batch() to execute it. Should reduce RPC calls by 2 magnitudes. 2. Combine batching with scanning more words, thus aggregating your count for a certain word thus less Increment commands. 3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at least. 4. Don't use keyValue.getValue(). It does a System.arraycopy behind the scenes. Use getBuffer() and getValueOffset() and getValueLength() and iterate on the existing array. Write your own Split without going into using String functions which goes through encoding (expensive). Just find your delimiter by byte comparison. 5. Enable BloomFilters on doc table. It should speed up the checkAndPut. 6. I wouldn't give up WAL. It ain't your bottleneck IMO. On Monday, February 18, 2013, prakash kadel wrote: Thank you guys for your replies, Michael, I think i didnt make it clear. Here is my use case, I have text documents to insert in the hbase. (With possible duplicates) Suppose i have a document as : I am working. He is not working I want to insert this document to a table in hbase, say table doc =doc table= - rowKey : doc_id cf: doc_content value: I am working. He is not working Now, i to create another table that stores the word count, say doc_idx doc_idx table --- rowKey : I, cf: count, value: 1 rowKey : am, cf: count, value: 1 rowKey : working, cf: count, value: 2 rowKey : He, cf: count, value: 1 rowKey : is, cf: count, value: 1 rowKey : not, cf: count, value: 1 My MR job code: == if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) { for(String word : doc_content.split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); } } Now, i wanted to do some experiments with coprocessors. So, i modified the code as follows. My MR job code: === doc.checkAndPut(rowKey, doc_content, , null, putDoc); Coprocessor code: === public void start(CoprocessorEnvironment env) { pool = new HTablePool(conf, 100); } public boolean postCheckAndPut(c, row, family, byte[] qualifier, compareOp, comparator, put, result) { if(!result) return true; // check if the put succeeded HTableInterface table_idx = pool.getTable(doc_idx); try { for(KeyValue contentKV = put.get(doc_content, )) { for(String word : contentKV.getValue().split(\\s+)) { Increment inc = new Increment(Bytes.toBytes(word)); inc.addColumn(count, , 1); table_idx.increment(inc); } } } finally { table_idx.close(); } return true; } public void stop(env) { pool.close(); } I am a newbee to HBASE. I am not sure this is the way to do. Given that, why is the cooprocessor enabled version much slower than the one without? Sincerely, Prakash Kadel On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel michael_se...@hotmail.com javascript:; wrote: The issue I was talking about was the use of a check and put. The OP wrote: each map inserts to doc table.(checkAndPut) regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table. My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut? Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. He should just be using put() and then a postPut(). Another issue... since he's writing to a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? Without seeing his code, we don't know. Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted. Interestingly enough, you may want to consider disabling the WAL on the write to the index. You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data. Indexes *ARE* expendable. ;-) Does that explain it? -Mike On Feb 18, 2013, at 4:57 AM, yonghu yongyong...@gmail.com wrote: Hi, Michael I don't quite understand what do you mean by round trip back to the client. In my understanding, as the RegionServer and TaskTracker can be the same node, MR don't have to pull data into client and then process. And you also mention the
Is there any way to balance one table?
Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
Re: Is there any way to balance one table?
What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
RE: Is there any way to balance one table?
0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
availability of 0.94.4 and 0.94.5 in maven repo?
Unless I'm doing something wrong, it looks like the Maven repository (http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James
Re: availability of 0.94.4 and 0.94.5 in maven repo?
I have come across this too, I think someone with authorization needs to perform a maven release to the apache maven repository and/or maven central. For now, I just end up compiling the dot release from trunk and deploy it to my local repository for other projects to use. Thanks, Viral On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote: Unless I'm doing something wrong, it looks like the Maven repository ( http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James
Re: availability of 0.94.4 and 0.94.5 in maven repo?
I also came up with the same issue 1 day ago while building YCSB HBase client for HBase 0.94.5. Later I used the 0.94.3 version to carry out my work for the time being. Regards, Joarder Kamal On 20 February 2013 12:32, Viral Bajaria viral.baja...@gmail.com wrote: I have come across this too, I think someone with authorization needs to perform a maven release to the apache maven repository and/or maven central. For now, I just end up compiling the dot release from trunk and deploy it to my local repository for other projects to use. Thanks, Viral On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.com wrote: Unless I'm doing something wrong, it looks like the Maven repository ( http://mvnrepository.com/**artifact/org.apache.hbase/**hbase http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James
RE: Is there any way to balance one table?
I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
Re: availability of 0.94.4 and 0.94.5 in maven repo?
Same here, just tripped over this moments ago. On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote: Unless I'm doing something wrong, it looks like the Maven repository ( http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: Is there any way to balance one table?
Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
Problem In Understanding Compaction Process
Hi: Guys I have some problem in understanding the compaction process, Can someone shed some light on me, much appreciate. Here is the problem: Region Server after successfully generate the final compacted file, it going through two steps: 1. move the above compacted file into region's directory 2. delete replaced files. the above two steps are not atomic, if Region Server crash after step1, and before step2, then there are duplication records! Is this problem handled in reading process , or there is another mechanism to fix this? -- Best Regards Anty Rao
Re: Is there any way to balance one table?
HBASE-3373 introduced hbase.master.loadbalance.bytable which defaults to true. This means when you issue 'balancer' command in shell, table should be balanced for you. Cheers On Tue, Feb 19, 2013 at 5:16 PM, Liu, Raymond raymond@intel.com wrote: 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
RE: Is there any way to balance one table?
Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
Re: Is there any way to balance one table?
What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186 http://twitter.com/marcosluis2186
Re: Is there any way to balance one table?
You're right. Default sloppiness is 20%: this.slop = conf.getFloat(hbase.regions.slop, (float) 0.2); src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java Meaning, region count on any server can be as far as 20% from average region count. You can tighten sloppiness. On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond raymond@intel.com wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
RE: Is there any way to balance one table?
I mean region number is small. Overall I have say 3000 region on 4 node, while this table only have 96 region. It won't be 24 for each region server, instead , will be something like 19/30/23/21 etc. This means that I need to limit the slop to 0.02 etc? so that the balancer actually run on this table? Best Regards, Raymond Liu From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, February 20, 2013 11:44 AM To: user@hbase.apache.org Cc: Liu, Raymond Subject: Re: Is there any way to balance one table? What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186
RE: Is there any way to balance one table?
Yeah, Since balance is already done on each table, why slop is not calculate upon each table... You're right. Default sloppiness is 20%: this.slop = conf.getFloat(hbase.regions.slop, (float) 0.2); src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java Meaning, region count on any server can be as far as 20% from average region count. You can tighten sloppiness. On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond raymond@intel.com wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu
Re: Is there any way to balance one table?
Yes, Raymond. You should lower sloppiness. On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com wrote: I mean region number is small. Overall I have say 3000 region on 4 node, while this table only have 96 region. It won't be 24 for each region server, instead , will be something like 19/30/23/21 etc. This means that I need to limit the slop to 0.02 etc? so that the balancer actually run on this table? Best Regards, Raymond Liu From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, February 20, 2013 11:44 AM To: user@hbase.apache.org Cc: Liu, Raymond Subject: Re: Is there any way to balance one table? What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186
RE: Is there any way to balance one table?
Hmm, in order to have the 96 region table be balanced within 20% On a 3000 region cluster when all other table is balanced. the slop will need to be around 20%/30, say 0.006? won't it be too small? Yes, Raymond. You should lower sloppiness. On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com wrote: I mean region number is small. Overall I have say 3000 region on 4 node, while this table only have 96 region. It won't be 24 for each region server, instead , will be something like 19/30/23/21 etc. This means that I need to limit the slop to 0.02 etc? so that the balancer actually run on this table? Best Regards, Raymond Liu From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, February 20, 2013 11:44 AM To: user@hbase.apache.org Cc: Liu, Raymond Subject: Re: Is there any way to balance one table? What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186
Re: Is there any way to balance one table?
bq. On a 3000 region cluster Balancing is per-table. Meaning total number of regions doesn't come into play. On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond raymond@intel.com wrote: Hmm, in order to have the 96 region table be balanced within 20% On a 3000 region cluster when all other table is balanced. the slop will need to be around 20%/30, say 0.006? won't it be too small? Yes, Raymond. You should lower sloppiness. On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com wrote: I mean region number is small. Overall I have say 3000 region on 4 node, while this table only have 96 region. It won't be 24 for each region server, instead , will be something like 19/30/23/21 etc. This means that I need to limit the slop to 0.02 etc? so that the balancer actually run on this table? Best Regards, Raymond Liu From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, February 20, 2013 11:44 AM To: user@hbase.apache.org Cc: Liu, Raymond Subject: Re: Is there any way to balance one table? What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186
region server of -ROOT- table is dead, but not reassigned
Hi, all, When I scan any table, I got: 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not be reached after 1 tries, giving up. ... ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: ... What I observe: 1) -ROOT- table is on Region Server rs1 Table Regions Name Region Server Start Key End Key Requests -ROOT- Rs1:60020http://adcau03.machine.wisdom.com:60030/ - - 2) But the region server rs1 is dead Dead Region Servers ServerName Rs4,60020,1361109702535 Rs1,60020,1361109710150 Total: servers: 2 Does it mean that the region server holding the -ROOT- table is dead, but the -ROOT- region is not reassigned to any other region servers? Why? Thanks, Wei
RE: region server of -ROOT- table is dead, but not reassigned
By the way, the hbase version I am using is 0.92.1-cdh4.0.1 From: Lu, Wei Sent: Wednesday, February 20, 2013 1:28 PM To: user@hbase.apache.org Subject: region server of -ROOT- table is dead, but not reassigned Hi, all, When I scan any table, I got: 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not be reached after 1 tries, giving up. ... ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: ... What I observe: 1) -ROOT- table is on Region Server rs1 Table Regions Name Region Server Start Key End Key Requests -ROOT- Rs1:60020http://adcau03.machine.wisdom.com:60030/ - - 2) But the region server rs1 is dead Dead Region Servers ServerName Rs4,60020,1361109702535 Rs1,60020,1361109710150 Total: servers: 2 Does it mean that the region server holding the -ROOT- table is dead, but the -ROOT- region is not reassigned to any other region servers? Why? Thanks, Wei
[resend] region server of -ROOT- table is dead, but not reassigned
Hi, all, When I scan any table, I got: 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not be reached after 1 tries, giving up. ... ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: ... What I observe: 1) -ROOT- table is on Region Server rs1 Table Regions NameRegion Server Start Key End Key Requests -ROOT- Rs1:60020 - - 2) But the region server rs1 is dead Dead Region Servers ServerName Rs4,60020,1361109702535 Rs1,60020,1361109710150 Total:servers: 2 Does it mean that the region server holding the -ROOT- table is dead, but the -ROOT- region is not reassigned to any other region servers? Why? By the way, the hbase version I am using is 0.92.1-cdh4.0.1 Thanks, Wei
RE: Is there any way to balance one table?
You mean slop is also base on per table? Weird, then it should work for my case let me check again. Best Regards, Raymond Liu bq. On a 3000 region cluster Balancing is per-table. Meaning total number of regions doesn't come into play. On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond raymond@intel.com wrote: Hmm, in order to have the 96 region table be balanced within 20% On a 3000 region cluster when all other table is balanced. the slop will need to be around 20%/30, say 0.006? won't it be too small? Yes, Raymond. You should lower sloppiness. On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com wrote: I mean region number is small. Overall I have say 3000 region on 4 node, while this table only have 96 region. It won't be 24 for each region server, instead , will be something like 19/30/23/21 etc. This means that I need to limit the slop to 0.02 etc? so that the balancer actually run on this table? Best Regards, Raymond Liu From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, February 20, 2013 11:44 AM To: user@hbase.apache.org Cc: Liu, Raymond Subject: Re: Is there any way to balance one table? What is the size of your table? On 02/19/2013 10:40 PM, Liu, Raymond wrote: Hi I do call balancer, while it seems it doesn't work. Might due to this table is small and overall region number difference is within threshold? -Original Message- From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Sent: Wednesday, February 20, 2013 10:59 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? Hi Liu, Why did not you simply called the balancer? If other tables are already balanced, it should not touch them and will only balance the table which is not balancer? JM 2013/2/19, Liu, Raymond raymond@intel.com: I choose to move region manually. Any other approaching? 0.94.1 Any cmd in shell? Or I need to change balance threshold to 0 an run global balancer cmd in shell? Best Regards, Raymond Liu -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, February 20, 2013 9:09 AM To: user@hbase.apache.org Subject: Re: Is there any way to balance one table? What version of HBase are you using ? 0.94 has per-table load balancing. Cheers On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote: Hi Is there any way to balance just one table? I found one of my table is not balanced, while all the other table is balanced. So I want to fix this table. Best Regards, Raymond Liu -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186
Re: [resend] region server of -ROOT- table is dead, but not reassigned
Ideally the ROOT table should be reassigned once the RS carrying ROOT goes down. This should happen automatically. May be what does your logs say. That would give us an insight. Before that if you can restart your master it may solve this problem. Even then if it persists try to delete the zk data and restart the cluster. REgards Ram On Wed, Feb 20, 2013 at 11:06 AM, Lu, Wei w...@microstrategy.com wrote: Hi, all, When I scan any table, I got: 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020could not be reached after 1 tries, giving up. ... ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: ... What I observe: 1) -ROOT- table is on Region Server rs1 Table Regions NameRegion Server Start Key End Key Requests -ROOT- Rs1:60020 - - 2) But the region server rs1 is dead Dead Region Servers ServerName Rs4,60020,1361109702535 Rs1,60020,1361109710150 Total:servers: 2 Does it mean that the region server holding the -ROOT- table is dead, but the -ROOT- region is not reassigned to any other region servers? Why? By the way, the hbase version I am using is 0.92.1-cdh4.0.1 Thanks, Wei
Re: availability of 0.94.4 and 0.94.5 in maven repo?
Time permitting, I will do that tomorrow. From: Andrew Purtell apurt...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Tuesday, February 19, 2013 6:58 PM Subject: Re: availability of 0.94.4 and 0.94.5 in maven repo? Same here, just tripped over this moments ago. On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote: Unless I'm doing something wrong, it looks like the Maven repository ( http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: PreSplit the table with Long format
Hello again, Doesn't anyone know how I can do this. The problem is: When you insert something from the shell, it supposes it's a string and then does a Bytes.toBytes conversion on the string and stores it in hbase. So how can I tell the shell that the thing I'm entering isn't a string? How I can put a value with a long format inside hbase, through the shell. If you need to know, I want to pre-split my table. I can't do it through java code, because I've installed a security library on hbase, with which I can create encrypted tables. It adds a securecreate command to the shell, from which I can create encrypted tables, but I can't create encrypted tables with the java code. So I'm forced to use the shell to create the table, and I want to pre-split my table with long values, because my row keys are in the long format. Please help, I really need this. Thanks On Tue, Feb 19, 2013 at 2:12 PM, Farrokh Shahriari mohandes.zebeleh...@gmail.com wrote: Tnx for your help,but it doesn't work.Do you have any other idea,cause I must run it from the shell. Farrokh On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria viral.baja...@gmail.comwrote: HBase shell is a jruby shell and so you can invoke any java commands from it. For example: import org.apache.hadoop.hbase.util.Bytes Bytes.toLong(Bytes.toBytes(1000)) Not sure if this works as expected since I don't have a terminal in front of me but you could try (assuming the SPLITS keyword takes byte array as input, never used SPLITS from the command line): create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000), Bytes.toBytes(2000), Bytes.toBytes(3000) ] } Thanks, Viral On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari mohandes.zebeleh...@gmail.com wrote: Hi there As I use rowkey in long format,I must presplit table in long format too.But when I've run this command,it presplit the table with STRING format : create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']} How can I presplit the table with Long format ? Farrokh