[ https://issues.apache.org/jira/browse/HBASE-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460737#comment-13460737 ]
stack commented on HBASE-6738: ------------------------------ What do you mean here: {quote}This allows to continue if the worker cannot actually handle it, + // for any reason.{quote} This seems like a small change extending timeout while also reacting faster if server is actually gone. I'm +1 on patch. > Too aggressive task resubmission from the distributed log manager > ----------------------------------------------------------------- > > Key: HBASE-6738 > URL: https://issues.apache.org/jira/browse/HBASE-6738 > Project: HBase > Issue Type: Bug > Components: master, regionserver > Affects Versions: 0.94.1, 0.96.0 > Environment: 3 nodes cluster test, but can occur as well on a much > bigger one. It's all luck! > Reporter: nkeywal > Priority: Critical > Attachments: 6738.v1.patch > > > With default settings for "hbase.splitlog.manager.timeout" => 25s and > "hbase.splitlog.max.resubmit" => 3. > On tests mentionned on HBASE-5843, I have variations around this scenario, > 0.94 + HDFS 1.0.3: > The regionserver in charge of the split does not answer in less than 25s, so > it gets interrupted but actually continues. Sometimes, we go out of the > number of retry, sometimes not, sometimes we're out of retry, but the as the > interrupts were ignored we finish nicely. In the mean time, the same single > task is executed in parallel by multiple nodes, increasing the probability to > get into race conditions. > Details: > t0: unplug a box with DN+RS > t + x: other boxes are already connected, to their connection starts to dies. > Nevertheless, they don't consider this node as suspect. > t + 180s: zookeeper -> master detects the node as dead. recovery start. It > can be less than 180s sometimes it around 150s. > t + 180s: distributed split starts. There is only 1 task, it's immediately > acquired by a one RS. > t + 205s: the RS has multiple errors when splitting, because a datanode is > missing as well. The master decides to give the task to someone else. But > often the task continues in the first RS. Interrupts are often ignored, as > it's well stated in the code ("// TODO interrupt often gets swallowed, do > what else?") > {code} > 2012-09-04 18:27:30,404 INFO > org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to > stop the worker thread > {code} > t + 211s: two regionsservers are processing the same task. They fight for the > leases: > {code} > 2012-09-04 18:27:32,004 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch > on > > /hbase/TABLE/4d1c1a4695b1df8c58d13382b834332e/recovered.edits/0000000000000000037.temp > owned by DFSClient_hb_rs_BOX2,60020,1346775882980 but is accessed by > DFSClient_hb_rs_BOX1,60020,1346775719125 > {code} > They can fight like this for many files, until the tasks finally get > interrupted or finished. > The taks on the second box can be cancelled as well. In this case, the > task is created again for a new box. > The master seems to stop after 3 attemps. It can as well renounce to > split the files. Sometimes the tasks were not cancelled on the RS side, so > the split is finished despites what the master thinks and logs. In this case, > the assignement starts. In the other, it's "we've got a problem"). > {code} > 2012-09-04 18:43:52,724 INFO org.apache.hadoop.hbase.master.SplitLogManager: > Skipping resubmissions of task > /hbase/splitlog/hdfs%3A%2F%2FBOX1%3A9000%2Fhbase%2F.logs%2FBOX0%2C60020%2C1346776587640-splitting%2FBOX0%252C60020%252C1346776587640.1346776587832 > because threshold 3 reached > {code} > t + 300s: split is finished. Assignement starts > t + 330s: assignement is finished, regions are available again. > There are a lot of subcases possible depending on the number of logs files, > of region server and so on. > The issues are: > 1) it's difficult, especially in HBase but not only, to interrupt a task. The > pattern is often > {code} > void f() throws IOException{ > try { > // whatever throw InterruptedException > }catch(InterruptedException){ > throw new InterruptedIOException(); > } > } > boolean g(){ > int nbRetry= 0; > for(;;) > try{ > f(); > return true; > }catch(IOException e){ > nbRetry++; > if ( nbRetry > maxRetry) return false; > } > } > } > {code} > This tyically shallows the interrupt. There are other variation, but this one > seems to be the standard. > Even if we fix this in HBase, we need the other layers to be Interrupteble as > well. That's not proven. > 2) 25s is very aggressive, considering that we have a default timeout of 180s > for zookeeper. In other words, we give 180s to a regionserver before acting, > but when it comes to split, it's 25s only. There may be reasons for this, but > it seems dangerous, as during a failure the cluster is less available than > during normal operations. We could do stuff around this, for example: > => Obvious option: increase the timeout at each try. Something like *2. > => Also possible: increase the initial timeout > => check for an update instead of blindly cancelling + resubmitting. > 3) Globally, it seems that this retry mechanism duplicates the failure > detection already in place with ZK. Would it not make sense to just hook into > this existing detection mechanism, and resubmit a task if and only if we > detect that the regionserver in charge died? During a failure scenario we > should be much more gentle than during normal operation, not the opposite. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira