Re: Overwrite a row
The schema is known beforehand so this is exactly what I need. Great! One more question. What guarantees does the batch operation have? Are the operations contained within each batch atomic? I.e. all mutations will be given the same timestamp? If something fails, all operation fail or can it fail partially? Thanks for your help, much appreciated. Cheers, -Kristoffer On Sat, Apr 20, 2013 at 4:47 AM, Ted Yu yuzhih...@gmail.com wrote: I don't know details about Kristoffer's schema. If all the column qualifiers are known a priori, mutateRow() should serve his needs. HBase allows arbitrary number of columns in a column family. If the schema is dynamic, mutateRow() wouldn't suffice. If the column qualifiers are known but the row is very wide (and a few columns are updated per call), performance would degrade. Just some factors to consider. Cheers On Fri, Apr 19, 2013 at 1:41 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Actually I do see it in the 0.94 JavaDocs ( http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ), so may be it was added in 0.94.6 even though the jira says fixed in 0.95 . I haven't used it though, but it seems that's what you're looking for. Sorry for confusion. Mohamed On Fri, Apr 19, 2013 at 4:35 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: It seems that 0.95 is not released yet, mutateRow won't be a solution for now. I saw it in the downloads and I thought it was released. On Fri, Apr 19, 2013 at 4:18 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Just noticed you want to delete as well. I think that's supported since 0.95 in mutateRow ( http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ). You can do multiple puts and deletes and they will be performed atomically. So you can remove qualifiers and put new ones. Mohamed On Fri, Apr 19, 2013 at 3:44 PM, Kristoffer Sjögren sto...@gmail.com wrote: What would you suggest? I want the operation to be atomic. On Fri, Apr 19, 2013 at 8:32 PM, Ted Yu yuzhih...@gmail.com wrote: What is the maximum number of versions do you allow for the underlying table ? Thanks On Fri, Apr 19, 2013 at 10:53 AM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is it possible to completely overwrite/replace a row in a single _atomic_ action? Already existing columns and qualifiers should be removed if they do not exist in the data inserted into the row. The only way to do this is to first delete the row then insert new data in its place, correct? Or is there an operation to do this? Cheers, -Kristoffer
Re: RefGuide schema design examples
+1 R. A. On 20 Apr 2013 12:07, Viral Bajaria viral.baja...@gmail.com wrote: +1! On Fri, Apr 19, 2013 at 4:09 PM, Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com wrote: Wow, great work, Doug. 2013/4/19 Doug Meil doug.m...@explorysmedical.com Hi folks, I reorganized the Schema Design case studies 2 weeks ago and consolidated them into here, plus added several cases common on the dist-list. http://hbase.apache.org/book.html#schema.casestudies Comments/suggestions welcome. Thanks! Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com -- Marcos Ortiz Valmaseda, *Data-Driven Product Manager* at PDVSA *Blog*: http://dataddict.wordpress.com/ *LinkedIn: *http://www.linkedin.com/in/marcosluis2186 *Twitter*: @marcosluis2186 http://twitter.com/marcosluis2186
Re: Slow region server recoveries
Hi, I looked at it again with a fresh eye. As Varun was saying, the root cause is the wrong order of the block locations. The root cause of the root cause is actually simple: HBASE started the recovery while the node was not yet stale from an HDFS pov. Varun mentioned this timing: Lost Beat: 27:30 Became stale: 27:50 - * this is a guess and reverse engineered (stale timeout 20 seconds) Became dead: 37:51 But the recovery started at 27:13 (15 seconds before we have this log line) 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.94:50010 for file /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0 for block BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.156.194.94:50010] So when we took the blocks from the NN, the datanode was not stale, so you have the wrong (random) order. ZooKeeper can expire a session before the timeout. I don't what why it does this in this case, but I don't consider it as a ZK bug: if ZK knows that a node is dead, it's its role to expire the session. There is something more fishy: we started the recovery while the datanode was still responding to heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS before vanishing away. Anyway, we then have an exception when we try to connect, because the RS does not have a TCP connection to this datanode. And this is retried many times. You would not have this with trunk, because HBASE-6435 reorders the blocks inside the client, using an information not available to the NN, excluding the datanode of the region server under recovery. Some conclusions: - we should likely backport hbase-6435 to 0.94. - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get hbase-6435). - There are some stuff that could be better in HDFS. I will see. - I'm worried by the SocketTimeoutException. We should get NoRouteToHost at a moment, and we don't. That's also why it takes ages. I think it's an AWS thing, but it brings to issue: it's slow, and, in HBase, you don't know if the operation could have been executed or not, so it adds complexity to some scenarios. If someone with enough network and AWS knowledge could clarify this point it would be great. Cheers, Nicolas On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma va...@pinterest.com wrote: This is 0.94.3 hbase... On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma va...@pinterest.com wrote: Hi Ted, I had a long offline discussion with nicholas on this. Looks like the last block which was still being written too, took an enormous time to recover. Here's what happened. a) Master split tasks and region servers process them b) Region server tries to recover lease for each WAL log - most cases are noop since they are already rolled over/finalized c) The last file lease recovery takes some time since the crashing server was writing to it and had a lease on it - but basically we have the lease 1 minute after the server was lost d) Now we start the recovery for this but we end up hitting the stale data node which is puzzling. It seems that we did not hit the stale datanode when we were trying to recover the finalized WAL blocks with trivial lease recovery. However, for the final block, we hit the stale datanode. Any clue why this might be happening ? Varun On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu yuzhih...@gmail.com wrote: Can you show snippet from DN log which mentioned UNDER_RECOVERY ? Here is the criteria for stale node checking to kick in (from https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch ): + * Check if the datanode is in stale state. Here if + * the namenode has not received heartbeat msg from a + * datanode for more than staleInterval (default value is + * {@link DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}), + * the datanode will be treated as stale node. On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma va...@pinterest.com wrote: Is there a place to upload these logs ? On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma va...@pinterest.com wrote: Hi Nicholas, Attached are the namenode, dn logs (of one of the healthy replicas of the WAL block) and the rs logs which got stuch doing the log split. Action begins at 2013-04-19 00:27*. Also, the rogue block is 5723958680970112840_174056. Its very interesting to trace this guy through the HDFS logs (dn and nn). Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also does the stale node stuff kick in for that state ? Thanks Varun On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon
Re: Overwrite a row
Operations within each batch are atomic. They would either all succeed or all fail. Time stamps would all refer to the latest cell (KeyVal). Cheers On Apr 20, 2013, at 12:17 AM, Kristoffer Sjögren sto...@gmail.com wrote: The schema is known beforehand so this is exactly what I need. Great! One more question. What guarantees does the batch operation have? Are the operations contained within each batch atomic? I.e. all mutations will be given the same timestamp? If something fails, all operation fail or can it fail partially? Thanks for your help, much appreciated. Cheers, -Kristoffer On Sat, Apr 20, 2013 at 4:47 AM, Ted Yu yuzhih...@gmail.com wrote: I don't know details about Kristoffer's schema. If all the column qualifiers are known a priori, mutateRow() should serve his needs. HBase allows arbitrary number of columns in a column family. If the schema is dynamic, mutateRow() wouldn't suffice. If the column qualifiers are known but the row is very wide (and a few columns are updated per call), performance would degrade. Just some factors to consider. Cheers On Fri, Apr 19, 2013 at 1:41 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Actually I do see it in the 0.94 JavaDocs ( http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ), so may be it was added in 0.94.6 even though the jira says fixed in 0.95 . I haven't used it though, but it seems that's what you're looking for. Sorry for confusion. Mohamed On Fri, Apr 19, 2013 at 4:35 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: It seems that 0.95 is not released yet, mutateRow won't be a solution for now. I saw it in the downloads and I thought it was released. On Fri, Apr 19, 2013 at 4:18 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Just noticed you want to delete as well. I think that's supported since 0.95 in mutateRow ( http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ). You can do multiple puts and deletes and they will be performed atomically. So you can remove qualifiers and put new ones. Mohamed On Fri, Apr 19, 2013 at 3:44 PM, Kristoffer Sjögren sto...@gmail.com wrote: What would you suggest? I want the operation to be atomic. On Fri, Apr 19, 2013 at 8:32 PM, Ted Yu yuzhih...@gmail.com wrote: What is the maximum number of versions do you allow for the underlying table ? Thanks On Fri, Apr 19, 2013 at 10:53 AM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is it possible to completely overwrite/replace a row in a single _atomic_ action? Already existing columns and qualifiers should be removed if they do not exist in the data inserted into the row. The only way to do this is to first delete the row then insert new data in its place, correct? Or is there an operation to do this? Cheers, -Kristoffer
Re: talk list table
Hope I'm not too late here... regarding hot spotting with sequential keys, I'd suggest you read this Sematext blog - http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ They present a nice idea there for this kind of issues. Good Luck! On Mon, Apr 15, 2013 at 11:18 PM, Ted Yu yuzhih...@gmail.com wrote: bq. write performance would be lower The above means poorer performance. bq. I could batch them up application side Please do that. bq. I guess there is no way to turn that off? That's right. On Mon, Apr 15, 2013 at 11:15 AM, Kireet kir...@feedly.com wrote: Thanks for the reply. write performance would be lower - this means better? Also I think I used the wrong terminology regarding batching. I meant to ask if it uses the client side write buffer. I would think not since the append() method returns a Result. I could batch them up application side I suppose. Append also seems to return the updated value. This seems like a lot of unnecessary I/O in my case since I am not immediately interested in the updated value. I guess there is no way to turn that off? On 4/15/13 1:28 PM, Ted Yu wrote: I assume you would select HBase 0.94.6.1 (the latest release) for this project. For #1, write performance would be lower if you choose to use Append (vs. using Put). bq. Can appends be batched by the client or do they execute immediately? This depends on your use case. Take a look at the following method in HTable where you can send a list of actions (Appends): public void batch(final List?extends Row actions, final Object[] results) For #2 bq. The other would be to prefix the timestamp row key with a random leading byte. This technique has been used elsewhere and is better than the first one. Cheers On Mon, Apr 15, 2013 at 6:09 AM, Kireet Reddy kireet-Teh5dPVPL8nQT0dZR+* *a...@public.gmane.org kireet-teh5dpvpl8nqt0dzr%2ba...@public.gmane.org wrote: I are planning to create a scheduled task list table in our hbase cluster. Essentially we will define a table with key timestamp and then the row contents will be all the tasks that need to be processed within that second (or whatever time period). I am trying to do the reasonably wide rows design mentioned in the hbasecon opentsdb talk. A couple of questions: 1. Should we use append or put to create tasks? Since these rows will not live forever, storage space in not a concern, read/write performance is more important. As concurrency increases I would guess the row lock may become an issue in append? Can appends be batched by the client or do they execute immediately? 2. I am a little worried about hotspots. This basic design may cause issues in terms of the table's performance. Many tasks will execute and reschedule themselves using the same interval, t + 1 hour for example. So many the writes may all go to the same block. Also, we have a lot of other data so I am worried it may impact performance of unrelated data if the region server gets too busy servicing the task list table. I can think of 2 strategies to avoid this. One would be to create N different tables and read/write tasks to them randomly. This may spread load across servers, but there is no guarantee hbase will place the tables on different region servers, correct? The other would be to prefix the timestamp row key with a random leading byte. Then when reading from the task list table, consumers could scan from any/all possible values of the random byte + current timestamp to obtain tasks. Both strategies seem like they could spread out load, but at the cost of more work/complexity to read tasks from the table. Do either of those approaches make sense? On the read side, it seems like a similar problem exists in that all consumers will be reading rows based on the current timestamp. Is this good because the block will very likely be cached or bad because the region server may become overloaded? I have a feeling the answer is going to be it depends. :) I did see the previous posts on queues and the tips there - use zookeeper for coordination, schedule major compactions, etc. Sorry if these questions are basic, I am pretty new to hbase. Thanks!
hbase + mapreduce
Hello: I'm working in a proyect, and i'm using hbase for storage the data, y have this method that work great but without the performance i'm looking for, so i want is to make the same but using mapreduce. public ArrayListMyObject findZ(String z) throws IOException { ArrayListMyObject rows = new ArrayListMyObject(); Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, test); Scan s = new Scan(); s.addColumn(Bytes.toBytes(x), Bytes.toBytes(y)); ResultScanner scanner = table.getScanner(s); try { for (Result rr : scanner) { if (Bytes.toString(rr.getValue(Bytes.toBytes(x), Bytes.toBytes(y))).equals(z)) { rows.add(getInformation(Bytes.toString(rr.getRow(; } } } finally { scanner.close(); } return archivos; } the getInformation method take all the columns and convert the row in MyObject type. I just want a example or a link to a tutorial that make something like this, i want to get a result type as answer and not a number to count words, like many a found. My natural language is spanish, so sorry if something is not well writing. Thanths http://www.uci.cu
Re: Overwrite a row
Just to absolutely be clear, is this also true for a batch that span multiple rows? On Sat, Apr 20, 2013 at 2:42 PM, Ted Yu yuzhih...@gmail.com wrote: Operations within each batch are atomic. They would either all succeed or all fail. Time stamps would all refer to the latest cell (KeyVal). Cheers On Apr 20, 2013, at 12:17 AM, Kristoffer Sjögren sto...@gmail.com wrote: The schema is known beforehand so this is exactly what I need. Great! One more question. What guarantees does the batch operation have? Are the operations contained within each batch atomic? I.e. all mutations will be given the same timestamp? If something fails, all operation fail or can it fail partially? Thanks for your help, much appreciated. Cheers, -Kristoffer On Sat, Apr 20, 2013 at 4:47 AM, Ted Yu yuzhih...@gmail.com wrote: I don't know details about Kristoffer's schema. If all the column qualifiers are known a priori, mutateRow() should serve his needs. HBase allows arbitrary number of columns in a column family. If the schema is dynamic, mutateRow() wouldn't suffice. If the column qualifiers are known but the row is very wide (and a few columns are updated per call), performance would degrade. Just some factors to consider. Cheers On Fri, Apr 19, 2013 at 1:41 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Actually I do see it in the 0.94 JavaDocs ( http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ), so may be it was added in 0.94.6 even though the jira says fixed in 0.95 . I haven't used it though, but it seems that's what you're looking for. Sorry for confusion. Mohamed On Fri, Apr 19, 2013 at 4:35 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: It seems that 0.95 is not released yet, mutateRow won't be a solution for now. I saw it in the downloads and I thought it was released. On Fri, Apr 19, 2013 at 4:18 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Just noticed you want to delete as well. I think that's supported since 0.95 in mutateRow ( http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ). You can do multiple puts and deletes and they will be performed atomically. So you can remove qualifiers and put new ones. Mohamed On Fri, Apr 19, 2013 at 3:44 PM, Kristoffer Sjögren sto...@gmail.com wrote: What would you suggest? I want the operation to be atomic. On Fri, Apr 19, 2013 at 8:32 PM, Ted Yu yuzhih...@gmail.com wrote: What is the maximum number of versions do you allow for the underlying table ? Thanks On Fri, Apr 19, 2013 at 10:53 AM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is it possible to completely overwrite/replace a row in a single _atomic_ action? Already existing columns and qualifiers should be removed if they do not exist in the data inserted into the row. The only way to do this is to first delete the row then insert new data in its place, correct? Or is there an operation to do this? Cheers, -Kristoffer
Re: Slow region server recoveries
Hi Nicholas, Regarding the following, I think this is not a recovery - the file below is an HFIle and is being accessed on a get request. On this cluster, I don't have block locality. I see these exceptions for a while and then they are gone, which means the stale node thing kicks in. 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.94:50010 for file /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/ 02f639bb43944d4ba9abcf58287831c0 for block This is the real bummer. The stale datanode is 1st even 90 seconds afterwards. *2013-04-19 00:28:35*,777 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of hdfs:// ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error java.io.IOException: Cannot obtain block length for LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056; getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010]}* ---at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238) ---at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182) ---at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124) ---at org.apache.hadoop.hdfs.DFSInputStream.init(DFSInputStream.java:117) ---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080) ---at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245) ---at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78) ---at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62) ---at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1707) ---at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1728) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) ---at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163) ---at java.lang.Thread.run(Thread.java:662) On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon nkey...@gmail.com wrote: Hi, I looked at it again with a fresh eye. As Varun was saying, the root cause is the wrong order of the block locations. The root cause of the root cause is actually simple: HBASE started the recovery while the node was not yet stale from an HDFS pov. Varun mentioned this timing: Lost Beat: 27:30 Became stale: 27:50 - * this is a guess and reverse engineered (stale timeout 20 seconds) Became dead: 37:51 But the recovery started at 27:13 (15 seconds before we have this log line) 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.94:50010 for file /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0 for block BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.156.194.94:50010] So when we took the blocks from the NN, the datanode was not stale, so you have the wrong (random) order. ZooKeeper can expire a session before the timeout. I don't what why it does this in this case, but I don't consider it as a ZK bug: if ZK knows that a node is dead, it's its role to expire the session. There is something more fishy: we started the recovery while the datanode was still responding to heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS before vanishing away. Anyway, we then have an exception when we try to connect, because the RS does not have a TCP connection to this datanode. And this is retried many times. You would not have this with trunk, because HBASE-6435 reorders the blocks inside the client, using
default region splitting on which value?
Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
Re: Slow region server recoveries
The important thing to note is the block for this rogue WAL is UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node thing kicks in correctly for UNDER_RECOVERY blocks but failed. On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma va...@pinterest.com wrote: Hi Nicholas, Regarding the following, I think this is not a recovery - the file below is an HFIle and is being accessed on a get request. On this cluster, I don't have block locality. I see these exceptions for a while and then they are gone, which means the stale node thing kicks in. 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.94:50010 for file /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/ 02f639bb43944d4ba9abcf58287831c0 for block This is the real bummer. The stale datanode is 1st even 90 seconds afterwards. *2013-04-19 00:28:35*,777 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of hdfs:// ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error java.io.IOException: Cannot obtain block length for LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056; getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010]}* ---at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238) ---at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182) ---at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124) ---at org.apache.hadoop.hdfs.DFSInputStream.init(DFSInputStream.java:117) ---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080) ---at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245) ---at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78) ---at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62) ---at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1707) ---at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1728) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) ---at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) ---at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381) ---at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195) ---at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163) ---at java.lang.Thread.run(Thread.java:662) On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon nkey...@gmail.comwrote: Hi, I looked at it again with a fresh eye. As Varun was saying, the root cause is the wrong order of the block locations. The root cause of the root cause is actually simple: HBASE started the recovery while the node was not yet stale from an HDFS pov. Varun mentioned this timing: Lost Beat: 27:30 Became stale: 27:50 - * this is a guess and reverse engineered (stale timeout 20 seconds) Became dead: 37:51 But the recovery started at 27:13 (15 seconds before we have this log line) 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.94:50010 for file /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0 for block BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.156.194.94:50010] So when we took the blocks from the NN, the datanode was not stale, so you have the wrong (random) order. ZooKeeper can expire a session before the timeout. I don't what why it does this in this case, but I don't consider it as a ZK bug: if ZK knows that a node is dead, it's its role to expire the session. There is something more fishy: we started the recovery while the datanode was still responding to heartbeat. I
Re: default region splitting on which value?
How many column families do you have ? For #3, per-splitting table at the row keys corresponding to peaks makes sense. On Apr 20, 2013, at 10:52 AM, Pal Konyves paul.kony...@gmail.com wrote: Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
Re: default region splitting on which value?
Hi Ted, Only one family, my data is very simple key-value, although I want to make sequential scan, so making a hash of the key is not an option. On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu yuzhih...@gmail.com wrote: How many column families do you have ? For #3, per-splitting table at the row keys corresponding to peaks makes sense. On Apr 20, 2013, at 10:52 AM, Pal Konyves paul.kony...@gmail.com wrote: Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
Re: default region splitting on which value?
The answer to your first question is yes - midkey of the key range would be chosen as split key. For #2, can you tell us how you plan to randomize the loading ? Bulk load normally means preparing HFiles which would be loaded directly into your table. Cheers On Apr 20, 2013, at 1:11 PM, Pal Konyves paul.kony...@gmail.com wrote: Hi Ted, Only one family, my data is very simple key-value, although I want to make sequential scan, so making a hash of the key is not an option. On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu yuzhih...@gmail.com wrote: How many column families do you have ? For #3, per-splitting table at the row keys corresponding to peaks makes sense. On Apr 20, 2013, at 10:52 AM, Pal Konyves paul.kony...@gmail.com wrote: Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
Re: default region splitting on which value?
I am making a paper for school about HBase, so the data I chose is not a real usable example. I am familiar with GTFS that is a de facto standard for storing information about public transportation schedules: when vehicle arrives to a stop and where it goes toward. I chose to genrate the rows on the fly, where each row represents a sequence of 'bus' stops that make a route from the first stop until the last stop. e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops where within the [...] is the rowkey. So long story short, I generate the data. I want to use the HBase java client api to store the rows with Put. I plan to randomize it by picking random first_stop_id-s, and use more threads. the rowkeys will still have a sequence, because the way I generate the rows will output about 100-1000 rows starting with the same first_stop_id within the rowkey. The total ammount of rows will be about billions, and would take up about 1TB. On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu yuzhih...@gmail.com wrote: The answer to your first question is yes - midkey of the key range would be chosen as split key. For #2, can you tell us how you plan to randomize the loading ? Bulk load normally means preparing HFiles which would be loaded directly into your table. Cheers On Apr 20, 2013, at 1:11 PM, Pal Konyves paul.kony...@gmail.com wrote: Hi Ted, Only one family, my data is very simple key-value, although I want to make sequential scan, so making a hash of the key is not an option. On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu yuzhih...@gmail.com wrote: How many column families do you have ? For #3, per-splitting table at the row keys corresponding to peaks makes sense. On Apr 20, 2013, at 10:52 AM, Pal Konyves paul.kony...@gmail.com wrote: Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal
Re: talk list table
+ http://blog.sematext.com/2012/12/24/hbasewd-and-hbasehut-handy-hbase-libraries-available-in-public-maven-repo/ if you use Maven and want to use HBaseWD. Otis -- HBASE Performance Monitoring - http://sematext.com/spm/index.html On Sat, Apr 20, 2013 at 11:24 AM, Amit Sela am...@infolinks.com wrote: Hope I'm not too late here... regarding hot spotting with sequential keys, I'd suggest you read this Sematext blog - http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ They present a nice idea there for this kind of issues. Good Luck! On Mon, Apr 15, 2013 at 11:18 PM, Ted Yu yuzhih...@gmail.com wrote: bq. write performance would be lower The above means poorer performance. bq. I could batch them up application side Please do that. bq. I guess there is no way to turn that off? That's right. On Mon, Apr 15, 2013 at 11:15 AM, Kireet kir...@feedly.com wrote: Thanks for the reply. write performance would be lower - this means better? Also I think I used the wrong terminology regarding batching. I meant to ask if it uses the client side write buffer. I would think not since the append() method returns a Result. I could batch them up application side I suppose. Append also seems to return the updated value. This seems like a lot of unnecessary I/O in my case since I am not immediately interested in the updated value. I guess there is no way to turn that off? On 4/15/13 1:28 PM, Ted Yu wrote: I assume you would select HBase 0.94.6.1 (the latest release) for this project. For #1, write performance would be lower if you choose to use Append (vs. using Put). bq. Can appends be batched by the client or do they execute immediately? This depends on your use case. Take a look at the following method in HTable where you can send a list of actions (Appends): public void batch(final List?extends Row actions, final Object[] results) For #2 bq. The other would be to prefix the timestamp row key with a random leading byte. This technique has been used elsewhere and is better than the first one. Cheers On Mon, Apr 15, 2013 at 6:09 AM, Kireet Reddy kireet-Teh5dPVPL8nQT0dZR+* *a...@public.gmane.org kireet-teh5dpvpl8nqt0dzr%2ba...@public.gmane.org wrote: I are planning to create a scheduled task list table in our hbase cluster. Essentially we will define a table with key timestamp and then the row contents will be all the tasks that need to be processed within that second (or whatever time period). I am trying to do the reasonably wide rows design mentioned in the hbasecon opentsdb talk. A couple of questions: 1. Should we use append or put to create tasks? Since these rows will not live forever, storage space in not a concern, read/write performance is more important. As concurrency increases I would guess the row lock may become an issue in append? Can appends be batched by the client or do they execute immediately? 2. I am a little worried about hotspots. This basic design may cause issues in terms of the table's performance. Many tasks will execute and reschedule themselves using the same interval, t + 1 hour for example. So many the writes may all go to the same block. Also, we have a lot of other data so I am worried it may impact performance of unrelated data if the region server gets too busy servicing the task list table. I can think of 2 strategies to avoid this. One would be to create N different tables and read/write tasks to them randomly. This may spread load across servers, but there is no guarantee hbase will place the tables on different region servers, correct? The other would be to prefix the timestamp row key with a random leading byte. Then when reading from the task list table, consumers could scan from any/all possible values of the random byte + current timestamp to obtain tasks. Both strategies seem like they could spread out load, but at the cost of more work/complexity to read tasks from the table. Do either of those approaches make sense? On the read side, it seems like a similar problem exists in that all consumers will be reading rows based on the current timestamp. Is this good because the block will very likely be cached or bad because the region server may become overloaded? I have a feeling the answer is going to be it depends. :) I did see the previous posts on queues and the tips there - use zookeeper for coordination, schedule major compactions, etc. Sorry if these questions are basic, I am pretty new to hbase. Thanks!
Re: Overwrite a row
Here is code from 0.94 code base: public void mutateRow(final RowMutations rm) throws IOException { new ServerCallableVoid(connection, tableName, rm.getRow(), operationTimeout) { public Void call() throws IOException { server.mutateRow(location.getRegionInfo().getRegionName(), rm); return null; where RowMutations has the following check: private void internalAdd(Mutation m) throws IOException { int res = Bytes.compareTo(this.row, m.getRow()); if(res != 0) { throw new IOException(The row in the recently added Put/Delete + Bytes.toStringBinary(m.getRow()) + doesn't match the original one + Bytes.toStringBinary(this.row)); This means you need to issue multiple mutateRow() calls for different rows. I think you should consider the potential impact on performance due to this limitation. For advanced usage, take a look at MultiRowMutationEndpoint: * This class demonstrates how to implement atomic multi row transactions using * {@link HRegion#mutateRowsWithLocks(java.util.Collection, java.util.Collection)} * and Coprocessor endpoints. Cheers On Sat, Apr 20, 2013 at 10:11 AM, Kristoffer Sjögren sto...@gmail.comwrote: Just to absolutely be clear, is this also true for a batch that span multiple rows? On Sat, Apr 20, 2013 at 2:42 PM, Ted Yu yuzhih...@gmail.com wrote: Operations within each batch are atomic. They would either all succeed or all fail. Time stamps would all refer to the latest cell (KeyVal). Cheers On Apr 20, 2013, at 12:17 AM, Kristoffer Sjögren sto...@gmail.com wrote: The schema is known beforehand so this is exactly what I need. Great! One more question. What guarantees does the batch operation have? Are the operations contained within each batch atomic? I.e. all mutations will be given the same timestamp? If something fails, all operation fail or can it fail partially? Thanks for your help, much appreciated. Cheers, -Kristoffer On Sat, Apr 20, 2013 at 4:47 AM, Ted Yu yuzhih...@gmail.com wrote: I don't know details about Kristoffer's schema. If all the column qualifiers are known a priori, mutateRow() should serve his needs. HBase allows arbitrary number of columns in a column family. If the schema is dynamic, mutateRow() wouldn't suffice. If the column qualifiers are known but the row is very wide (and a few columns are updated per call), performance would degrade. Just some factors to consider. Cheers On Fri, Apr 19, 2013 at 1:41 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Actually I do see it in the 0.94 JavaDocs ( http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ), so may be it was added in 0.94.6 even though the jira says fixed in 0.95 . I haven't used it though, but it seems that's what you're looking for. Sorry for confusion. Mohamed On Fri, Apr 19, 2013 at 4:35 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: It seems that 0.95 is not released yet, mutateRow won't be a solution for now. I saw it in the downloads and I thought it was released. On Fri, Apr 19, 2013 at 4:18 PM, Mohamed Ibrahim mibra...@mibrahim.net wrote: Just noticed you want to delete as well. I think that's supported since 0.95 in mutateRow ( http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#mutateRow(org.apache.hadoop.hbase.client.RowMutations) ). You can do multiple puts and deletes and they will be performed atomically. So you can remove qualifiers and put new ones. Mohamed On Fri, Apr 19, 2013 at 3:44 PM, Kristoffer Sjögren sto...@gmail.com wrote: What would you suggest? I want the operation to be atomic. On Fri, Apr 19, 2013 at 8:32 PM, Ted Yu yuzhih...@gmail.com wrote: What is the maximum number of versions do you allow for the underlying table ? Thanks On Fri, Apr 19, 2013 at 10:53 AM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is it possible to completely overwrite/replace a row in a single _atomic_ action? Already existing columns and qualifiers should be removed if they do not exist in the data inserted into the row. The only way to do this is to first delete the row then insert new data in its place, correct? Or is there an operation to do this? Cheers, -Kristoffer
Re: default region splitting on which value?
Thanks for sharing the information below. How do you plan to store time (when the bus gets to each stop) in the row ? Or maybe it is not of importance to you ? On Sat, Apr 20, 2013 at 2:24 PM, Pal Konyves paul.kony...@gmail.com wrote: I am making a paper for school about HBase, so the data I chose is not a real usable example. I am familiar with GTFS that is a de facto standard for storing information about public transportation schedules: when vehicle arrives to a stop and where it goes toward. I chose to genrate the rows on the fly, where each row represents a sequence of 'bus' stops that make a route from the first stop until the last stop. e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops where within the [...] is the rowkey. So long story short, I generate the data. I want to use the HBase java client api to store the rows with Put. I plan to randomize it by picking random first_stop_id-s, and use more threads. the rowkeys will still have a sequence, because the way I generate the rows will output about 100-1000 rows starting with the same first_stop_id within the rowkey. The total ammount of rows will be about billions, and would take up about 1TB. On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu yuzhih...@gmail.com wrote: The answer to your first question is yes - midkey of the key range would be chosen as split key. For #2, can you tell us how you plan to randomize the loading ? Bulk load normally means preparing HFiles which would be loaded directly into your table. Cheers On Apr 20, 2013, at 1:11 PM, Pal Konyves paul.kony...@gmail.com wrote: Hi Ted, Only one family, my data is very simple key-value, although I want to make sequential scan, so making a hash of the key is not an option. On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu yuzhih...@gmail.com wrote: How many column families do you have ? For #3, per-splitting table at the row keys corresponding to peaks makes sense. On Apr 20, 2013, at 10:52 AM, Pal Konyves paul.kony...@gmail.com wrote: Hi, I am just reading about region splitting. By default - as I understand - Hbase handles splitting the regions. I just don't know how to imagine on which key it splits the regions. 1) For example when I write MD5 hash of rowkeys, they are most probably evenly distributed from 00... to F... right? When Hbase starts with one region, all the writes goes into that region, and when the HFile get's too big, it just gets for example the median value of the stored keys, and split the region by this? 2) I want to bulk load tons of data with the HBase java client API put operations. I want it to perform well. My keys are numeric sequential values (which I know from this post, I cannot load into Hbase sequentially, because the Hbase tables are going to be sad http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ ) So I thought I would pre-split the table into regions, and load the data randomized. This way I will get good distribution among region servers in terms of network IO from the beginning. Is that a good idea? 3) If my rowkeys are not evenly distributed in the keyspace, but they show some peaks or bursts. e.g. 000-999, but most of the keys gather around 020 and 060 values, is it a good idea to have the pre region splits at those peaks? Thanks in advance, Pal