[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6550: - Fix Version/s: (was: 0.95.0) 0.94.2 Fix up after bulk move overwrote some 0.94.2 fix versions w/ 0.95.0 (Noticed by Lars Hofhansl) Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: Replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.94.2 Attachments: 6550-havealook.txt, HBase-6550-0.94.patch, HBase-6550-0.94-v2.patch, HBase-6550-0.94-v3.patch, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch, HBase-6550-v4.patch, HBase-6550-v5.patch, HBase-6550-v6.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-6550: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to 0.94 and 0.96 Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.2 Attachments: 6550-havealook.txt, HBase-6550-0.94.patch, HBase-6550-0.94-v2.patch, HBase-6550-0.94-v3.patch, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch, HBase-6550-v4.patch, HBase-6550-v5.patch, HBase-6550-v6.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-6550: - Fix Version/s: 0.94.2 0.96.0 Let's get this into 0.94 as well. Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.2 Attachments: 6550-havealook.txt, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch, HBase-6550-v4.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-6550: --- Status: Patch Available (was: Open) Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.2 Attachments: 6550-havealook.txt, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch, HBase-6550-v4.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-6550: --- Attachment: HBase-6550.patch Ok, I removed the bailout behavior. Attached is a patch. Replication tests pass; also did a smoke testing on a real cluster. Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Attachments: 6550-havealook.txt, HBase-6550.patch, HBase-6550-v1.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-6550: --- Attachment: HBase-6550-v3.patch closing connection and thread pool in separate try-catch blocks. Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Attachments: 6550-havealook.txt, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-6550: --- Attachment: HBase-6550-v4.patch Sorry, I missed the clone in the last patch. Included other comments. Thank you all for the feedback. Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Attachments: 6550-havealook.txt, HBase-6550.patch, HBase-6550-v1.patch, HBase-6550-v3.patch, HBase-6550-v4.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-6550: --- Attachment: HBase-6550-v1.patch Attached is a patch to incorporate the suggestions mentioned in the description. Testing: jenkins is green; ran replication for a few days (intermittently running ycsb write load on master), in tandem with HBase-6165. Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Attachments: HBase-6550-v1.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6550) Refactoring ReplicationSink to make it more responsive of cluster health
[ https://issues.apache.org/jira/browse/HBASE-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-6550: - Attachment: 6550-havealook.txt So you guarding against the client (i.e. the ReplicationSource) going away. I see. Although I would not think that that would be a common problem once the timeouts here are short enough. Just because it is easier to make a patch than to describe what I mean, I made one. I am *not* saying you to do it this way, just showing what I mean. Have a look (the Executor probably needs tweaking and the DaemonThreadFactory should go into a common class, but you get the gist). Refactoring ReplicationSink to make it more responsive of cluster health Key: HBASE-6550 URL: https://issues.apache.org/jira/browse/HBASE-6550 Project: HBase Issue Type: New Feature Components: replication Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Attachments: 6550-havealook.txt, HBase-6550-v1.patch ReplicationSink replicates the WALEdits in the local cluster. It uses native HBase client to insert the mutations. Sometime, it takes a while to process it (may be due to region splitting, gc pause, etc) and it undergoes the retrial phase. It has two repercussions: a) The regionserver handler which is serving the request (till now, a priority handler) is blocked for this period. b) The caller may get timed out and it will retry it anyway, but the handler serving the ReplicationSink requests is still working. Refactoring ReplicationSink to have the following features: a) Making it more configurable (have its own number of retrial limit, connection timeout, etc) b) Add a fail fast behavior so that it bails out in case caller is timedout, or any exception in processing the mutation batch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira