[jira] [Commented] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768293#comment-13768293 ] Feng Honghua commented on HBASE-9480: - bq.(Cc'ing Feng Honghua since he expressed interest in this area too as is Jimmy Xiang of course) Yes. Seems current master/zk/RS's main communication pattern (RS updates zk node, master watches change of zk node), together with the asynchronous and 'one-time' nature of zk watch, result in too many corner cases for assignment manager(and region split). I'm making a proposal for new master/zk/RS's communcation pattern. The main theme is master sends request to RS, RS responses the progress back to master, master persists the request progress in another system table(like meta table), why not zk is for better throughput/performance for huge table with big number of regions... [~stack] / [~jxiang] Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Fix For: 0.98.0, 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch, trunk-9480_v2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9466) Read-only mode
[ https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768946#comment-13768946 ] Feng Honghua commented on HBASE-9466: - [~stack] / [~jdcryans] : OK, I'll make a generic framework first and then implement readonly based on it. Read-only mode -- Key: HBASE-9466 URL: https://issues.apache.org/jira/browse/HBASE-9466 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Can we provide a read-only mode for a table? write to the table in read-only mode will be rejected, but read-only mode is different from disable in that: 1. it doesn't offline the regions of the table(hence much more lightweight than disable) 2. it can serve read requests Comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
[ https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768963#comment-13768963 ] Feng Honghua commented on HBASE-8751: - [~jdcryans] Thanks for the thorough code review, but below is not true: bq. This is in ReplicationSource.removeNonReplicableEdits() and that method is called for each HLog.Entry, which means that you'd hit ZK from all the region servers for as many write calls as they are getting. That seems excessive. == zkHelper.getTableCFs(peerId) delegates to ReplicationPeer.getTableCFs, and ReplicationPeer maintains the current table/cf configs in its tableCFs field and returns it per getTableCFs call. And ReplicationPeer has a tableCFTracker which is watching tableCF zk node and updates tableCFs field accordingly once tableCF zk node is changed(by user via shell). This process is similiar to the peer state(enable/disable) treatment. So tableCF zk node will be access same times as it's updated, not same times ReplicationSource.removeNonReplicableEdits() is called (for each HLog.Entry) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Attachments: HBASE-8751-0.94-V0.patch Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9466) Read-only mode
[ https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9466: Assignee: Feng Honghua Read-only mode -- Key: HBASE-9466 URL: https://issues.apache.org/jira/browse/HBASE-9466 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Can we provide a read-only mode for a table? write to the table in read-only mode will be rejected, but read-only mode is different from disable in that: 1. it doesn't offline the regions of the table(hence much more lightweight than disable) 2. it can serve read requests Comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769138#comment-13769138 ] Feng Honghua commented on HBASE-9467: - [~stack] Thanks for the effort write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9467-trunk-v0.patch, HBASE-9467-trunk-v1.patch, HBASE-9467-trunk-v1.patch, HBASE-9467-trunk-v1.patch Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
[ https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769173#comment-13769173 ] Feng Honghua commented on HBASE-8751: - [~jdcryans] bq.Normally in the HBase code when you append str or string to a method name, it just means that it does the same thing but returns a String instead. You'll have to review naming. == what about 'getTableCFsConfig' for getTableCFsStr? bq.Digging down to ReplicationPeer.tableCFs, shouldn't this be at least be made a volatile? It seems you could have some inconsistencies. == The ReplicationPeer.tableCFs map can only be updated by the (same/single) event thread of zookeeper(ZooKeeperWatcher) of ReplicationPeer, and ReplicationSource calls zkHelper.getTableCFs(peerId) for each hlog entry. Seems not a must to declare it volatile? Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Attachments: HBASE-8751-0.94-V0.patch Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768045#comment-13768045 ] Feng Honghua commented on HBASE-9467: - [~nkeywal] bq. I would propose to reuse RegionTooBusyException. It's seems too similar to RegionOverloadedException agree. I'll make a new patch accordingly write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9467-trunk-v0.patch Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768046#comment-13768046 ] Feng Honghua commented on HBASE-9467: - [~tlipcon] bq. is this a compatible change? thanks for the reminder. per [~nkeywal]'s suggestion, there is no compatibility issue when we reuse RegionTooBusyException here, right? write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9467-trunk-v0.patch Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9467: Attachment: HBASE-9467-trunk-v0.patch patch for trunk write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9467-trunk-v0.patch Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766263#comment-13766263 ] Feng Honghua commented on HBASE-9467: - Change and explanation of the patch: 1. Throw RegionOverloadedException immediately rather than wait/retry within HRegion when the target region is above the memstore limit, this avoid write requests on region above memstore limit occupying/saturating handler threads. This change is in HRegion.checkResources method. 2. Reuse the exception handling and retry mechanism of AsyncProcess in client to handle RegionOverloadedException thrown from RS. Since RegionOverloadedException is not a DoNotRetryIOException, it'll be handled the same way as other non-DoNotRetryIOException thrown from RS by AsyncProcess and the according request will be retried using incremental backoff. In a more general sense, we can view RegionOverloadedException as another kind of retriable exception and reuse all the current handling for it in AsyncProcess/client, so no change in client side code. And if we really want to use exponential backoff rather than incremental backoff for RegionOverloadedException, as Todd suggested, we can change the code in AsyncProcess accordingly. 3. We also need to check memstore limit and throw RegionOverloadedException for 'increment' and 'append' operations, since they also insert kv to memstore and increase its size. (checkResources is not called for these two operations in HRegion previously, corrected here) 4. In UT TestHFileArchiving, RegionOverloadedException is thrown during loadRegion and since the 'put' operations are called directly via HRegion, not via client/AsyncProcess, a similiar 'catch-and-wait' handling is added here to proceed without failure. [~nkeywal] / [~stack] / [~tlipcon] : Any feedback for the patch? Thanks in advance. write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9467-trunk-v0.patch Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764082#comment-13764082 ] Feng Honghua commented on HBASE-9467: - I like [~tlipcon]'s idea that we reject write by RegionOverloadedException rather than blocking writes. This treatment can also avoid the unnecessary scenario where RS eventually finishes the write after the memstore flush is done but client gets timeout response if memstore flush takes too long But the client receiving such exception can only perform backoff for writes with the same rowKey as which is responsed such exception, hence can't prevent writes with different rowKeys belonging to the same region from hitting the RS and get RegionOverloadedException as well (considering client typically is unaware of the region key range when doing write) write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764133#comment-13764133 ] Feng Honghua commented on HBASE-9467: - [~liochon] maybe I used wrong term here, I meant code calling HTable.put() etc. to write data to HBase for 'client'. Sure HBase client knows the key range of region. I can write the patch per [~tlipcon]'s solution, if no any objection. :) write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9467: Assignee: Feng Honghua write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9500) RpcServer can't be restarted after stop for reuse
Feng Honghua created HBASE-9500: --- Summary: RpcServer can't be restarted after stop for reuse Key: HBASE-9500 URL: https://issues.apache.org/jira/browse/HBASE-9500 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Priority: Minor Currently RpcServer is designed/implemented without inerface/capability for user to restart a stopped RpcServer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764146#comment-13764146 ] Feng Honghua commented on HBASE-9467: - Thanks [~dnicolas] for the reminder, I'll take these into account. write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764150#comment-13764150 ] Feng Honghua commented on HBASE-9467: - correct a typo in above comment: Thanks [~liochon] btw: no way to delete/edit a submitted comment? sounds inconvenient :-( write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9467: Priority: Major (was: Minor) write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Assignee: Feng Honghua Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9501) No throttling for replication
Feng Honghua created HBASE-9501: --- Summary: No throttling for replication Key: HBASE-9501 URL: https://issues.apache.org/jira/browse/HBASE-9501 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua When we disable a peer for a time of period, and then enable it, the ReplicationSource in master cluster will push the accumulated hlog entries during the disabled interval to the re-enabled peer cluster at full speed. If the bandwidth of the two clusters is shared by different applications, the push at full speed for replication can use all the bandwidth and severely influence other applications. Though there are two config replication.source.size.capacity and replication.source.nb.capacity to tweak the batch size each time a push delivers, but if decrease these two configs, the number of pushes increase, and all these pushes proceed continuously without pause. And no obvious help for the bandwidth throttling. From bandwidth-sharing and push-speed perspective, it's more reasonable to provide a bandwidth up limit for each peer push channel, and within that limit, peer can choose a big batch size for each push for bandwidth efficiency. Any opinion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9500) RpcServer can't be restarted after stop for reuse
[ https://issues.apache.org/jira/browse/HBASE-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9500: Component/s: IPC/RPC RpcServer can't be restarted after stop for reuse - Key: HBASE-9500 URL: https://issues.apache.org/jira/browse/HBASE-9500 Project: HBase Issue Type: Improvement Components: IPC/RPC Reporter: Feng Honghua Priority: Minor Currently RpcServer is designed/implemented without inerface/capability for user to restart a stopped RpcServer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9469) Synchronous replication
[ https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9469: Assignee: Feng Honghua Synchronous replication --- Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Assignee: Feng Honghua Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9469) Synchronous replication
[ https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9469: Priority: Major (was: Minor) Synchronous replication --- Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
[ https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762885#comment-13762885 ] Feng Honghua commented on HBASE-8751: - [~jdcryans] would you please help review this patch? thanks Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Attachments: HBASE-8751-0.94-V0.patch Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763937#comment-13763937 ] Feng Honghua commented on HBASE-9468: - [~stack] RpcServer can't support turning on after taking down currently. We provide a config to determine if fail-fast for expired master and default is false, which can be used for small cluster without backup master. Opinion? Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9468: Attachment: HBASE-9468-trunk-v0.patch patch for trunk Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua Attachments: HBASE-9468-trunk-v0.patch When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-9468: Assignee: Feng Honghua Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-9468-trunk-v0.patch When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9464) master failure during region-move can result in the region moved to a different RS rather than the destination one user specified
Feng Honghua created HBASE-9464: --- Summary: master failure during region-move can result in the region moved to a different RS rather than the destination one user specified Key: HBASE-9464 URL: https://issues.apache.org/jira/browse/HBASE-9464 Project: HBase Issue Type: Bug Components: master Reporter: Feng Honghua Priority: Minor 1. user issues region-move by specifying a destination RS 2. master finishes offlining the region 3. master fails before assigning it the the specified destination RS 4. new master assigns the region to a random RS since it doesn't have destination RS info -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster
Feng Honghua created HBASE-9465: --- Summary: HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster Key: HBASE-9465 URL: https://issues.apache.org/jira/browse/HBASE-9465 Project: HBase Issue Type: Bug Components: regionserver, Replication Reporter: Feng Honghua When region-move or RS failure occurs in master cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination. This treatment can possibly lead to data inconsistency between master and peer clusters: 1. there are put and then delete written to master cluster 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster In this scenario, the put remains in peer cluster, but in master cluster the put is masked by the delete, hence data inconsistency between master and peer clusters -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9466) Read-only mode
Feng Honghua created HBASE-9466: --- Summary: Read-only mode Key: HBASE-9466 URL: https://issues.apache.org/jira/browse/HBASE-9466 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Can we provide a read-only mode for a table? write to the table in read-only mode will be rejected, but read-only mode is different from disable in that: 1. it doesn't offline the regions of the table(hence much more lightweight than disable) 2. it can serve read requests Comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
Feng Honghua created HBASE-9467: --- Summary: write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
Feng Honghua created HBASE-9468: --- Summary: Previous active master can still serves RPC request when it is trying recovering expired zk session Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9469) Synchronous replication
Feng Honghua created HBASE-9469: --- Summary: Synchronous replication Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9464) master failure during region-move can result in the region moved to a different RS rather than the destination one user specified
[ https://issues.apache.org/jira/browse/HBASE-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762642#comment-13762642 ] Feng Honghua commented on HBASE-9464: - From the perspective of user who issues the region-move request, the region is moved to a different RS from the one he specified and the destination RS he specifies is healthy. The root cause is the RegionPlan containing the destination RS info is just kept in master's memory without persistence, so the new active master doesn't know this info when take over the active master role. master failure during region-move can result in the region moved to a different RS rather than the destination one user specified - Key: HBASE-9464 URL: https://issues.apache.org/jira/browse/HBASE-9464 Project: HBase Issue Type: Bug Components: master Reporter: Feng Honghua Priority: Minor 1. user issues region-move by specifying a destination RS 2. master finishes offlining the region 3. master fails before assigning it the the specified destination RS 4. new master assigns the region to a random RS since it doesn't have destination RS info -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster
[ https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762658#comment-13762658 ] Feng Honghua commented on HBASE-9465: - [~jdcryans] I have a draft for a new piece of documentation that we could add to the ref guide that I should probably contribute -- where can I read this documentation? thanks. HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster -- Key: HBASE-9465 URL: https://issues.apache.org/jira/browse/HBASE-9465 Project: HBase Issue Type: Bug Components: regionserver, Replication Reporter: Feng Honghua When region-move or RS failure occurs in master cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination. This treatment can possibly lead to data inconsistency between master and peer clusters: 1. there are put and then delete written to master cluster 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster In this scenario, the put remains in peer cluster, but in master cluster the put is masked by the delete, hence data inconsistency between master and peer clusters -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster
[ https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762665#comment-13762665 ] Feng Honghua commented on HBASE-9465: - [~lhofhansl] For RS failure scenario, can we delay the assigning of recovered regions until all the remained hlog files of the failed RS are pushed to peer clusters (the hlog split can be parallel with the hlog push though)? This way we can maintain the (global) serial push for hlog entries of a region even in face of RS failure. But for region-move it's harder to maintain global serial push since it's harder to determine all the hlog entries of a given region has been pushed to peer clusters when the containing RS is healthy and continuously receiving write requests. HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster -- Key: HBASE-9465 URL: https://issues.apache.org/jira/browse/HBASE-9465 Project: HBase Issue Type: Bug Components: regionserver, Replication Reporter: Feng Honghua When region-move or RS failure occurs in master cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination. This treatment can possibly lead to data inconsistency between master and peer clusters: 1. there are put and then delete written to master cluster 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster In this scenario, the put remains in peer cluster, but in master cluster the put is masked by the delete, hence data inconsistency between master and peer clusters -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9466) Read-only mode
[ https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762675#comment-13762675 ] Feng Honghua commented on HBASE-9466: - [~jdcryans] Yes, I'm proposing a 'read-only mode'(can be per-table or per-cluster) rather than 'disabling table', the latter is pretty heavyweight in that it needs to offline all regions of the given table. If we just want to temporarily disable 'update' to a table or the whole cluster and later on we want to enable 'update' again, disabling a table or all tables of the cluster seems a quite heavy choice. Thanks for pointing me to the read-only interface of HTableDescriptor of trunk, but I don't see any code using it, how is this RO expected to work? Read-only mode -- Key: HBASE-9466 URL: https://issues.apache.org/jira/browse/HBASE-9466 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Can we provide a read-only mode for a table? write to the table in read-only mode will be rejected, but read-only mode is different from disable in that: 1. it doesn't offline the regions of the table(hence much more lightweight than disable) 2. it can serve read requests Comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region
[ https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762680#comment-13762680 ] Feng Honghua commented on HBASE-9467: - [~nkeywal] Can we provide a percentage config which means how big the sub-set of handler threads any region's requests can use? And for any region we can hash its region name to a determined start index of the handler thread array, and the percentage config together with the count the handler threads determines the count of the sub-array of the handler threads to serve this region's requests. This way any region at worst can only saturate its sub-set of handler threads without impacting all the handler threads, and hence somewhat mitigates the symptom write can be totally blocked temporarily by a write-heavy region Key: HBASE-9467 URL: https://issues.apache.org/jira/browse/HBASE-9467 Project: HBase Issue Type: Improvement Reporter: Feng Honghua Priority: Minor Write to a region can be blocked temporarily if the memstore of that region reaches the threshold(hbase.hregion.memstore.block.multiplier * hbase.hregion.flush.size) until the memstore of that region is flushed. For a write-heavy region, if its write requests saturates all the handler threads of that RS when write blocking for that region occurs, requests of other regions/tables to that RS also can't be served due to no available handler threads...until the pending writes of that write-heavy region are served after the flush is done. Hence during this time period, from the RS perspective it can't serve any request from any table/region just due to a single write-heavy region. This sounds not very reasonable, right? Maybe write requests from a region can only be served by a sub-set of the handler threads, and then write blocking of any single region can't lead to the scenario mentioned above? Comment? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762683#comment-13762683 ] Feng Honghua commented on HBASE-9468: - Sounds just fail-fast for the expired master is a quick and safe fix for this issue. Any opinion? Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762684#comment-13762684 ] Feng Honghua commented on HBASE-9468: - [~enis] I agree with you :-). not sure if there is further concern of the recovering logic for expired master. Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9469) Synchronous replication
[ https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762686#comment-13762686 ] Feng Honghua commented on HBASE-9469: - [~jdcryans] and [~lhofhansl] : Any plan of synchronous replication? this is really a nice feature for applications requiring strict data safety/consistency across clusters Synchronous replication --- Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session
[ https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762754#comment-13762754 ] Feng Honghua commented on HBASE-9468: - [~stack] OK, I'll provide a patch according to the comment Previous active master can still serves RPC request when it is trying recovering expired zk session --- Key: HBASE-9468 URL: https://issues.apache.org/jira/browse/HBASE-9468 Project: HBase Issue Type: Bug Reporter: Feng Honghua When the active master's zk session expires, it'll try to recover zk session, but without turn off its RpcServer. What if a previous backup master has already become the now active master, and some client tries to send request to this expired master by using the cached master info? Any problem here? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9469) Synchronous replication
[ https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762755#comment-13762755 ] Feng Honghua commented on HBASE-9469: - [~lhofhansl] Yes, the better data safety/consistency of synchronous replication is gotten at the cost of higher latency. Maybe it's more acceptable to make it per-peer/per-table configurable, let me try to provide a patch accordingly Synchronous replication --- Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Priority: Minor Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761563#comment-13761563 ] Feng Honghua commented on HBASE-8755: - Thanks [~stack]. Looking forward to your test result on hdfs A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: Performance, wal Reporter: Feng Honghua Assignee: stack Priority: Critical Fix For: 0.96.1 Attachments: 8755trunkV2.txt, HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721973#comment-13721973 ] Feng Honghua commented on HBASE-8755: - [~jmspaggi], thanks for your test, some questions about your test: Is it against real HDFS? how many data-nodes and RS? what's the write pressure(client number, write thread number)? what's the total throughput you get? Yes this jira aims for throughput improvement under write intensive load. It should be tested and verified under write intensive load against real cluster / HDFS environment. And as you can see this jira only refactors the write thread model rather than tuning any write sub-phase along the whole write path for any individual write request, no obvious improvement is expected for low/ordinary write pressure. If you have a real cluster environment with 4 data-nodes, it would be better to re-do the test chunhui/I did with the similar test configuration/load which are listed in detail in above comments. 1 client with 200 write threads is OK for pressing a single RS and 4 clients each with 200 write threads for pressing 4 RS. Thanks again. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718270#comment-13718270 ] Feng Honghua commented on HBASE-8755: - [~jmspaggi] HBASE-8755-0.94-V1.patch is good. Let me know if any problem. Thanks Jean-Marc A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8755: Attachment: HBASE-8755-trunk-V1.patch updated patch rebased on latest trunk code base A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708473#comment-13708473 ] Feng Honghua commented on HBASE-8755: - Thanks a lot [~jmspaggi], looking forward to your result. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708189#comment-13708189 ] Feng Honghua commented on HBASE-8755: - [~jmspaggi], what's your result of running YCSB against real cluster environment? A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701813#comment-13701813 ] Feng Honghua commented on HBASE-8753: - [~lhofhansl] Your comment #2 is covered by this line of code: if (!hasFamilyStamp || timestamp familyStamp) { Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes, Scanners Affects Versions: 0.95.1 Reporter: Feng Honghua Assignee: Feng Honghua Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8753: Attachment: HBASE-8753-trunk-V3.patch HBASE-8753-0.94-V1.patch Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes, Scanners Affects Versions: 0.95.1 Reporter: Feng Honghua Assignee: Feng Honghua Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, HBASE-8753-0.94-V1.patch, HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch, HBASE-8753-trunk-V3.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702008#comment-13702008 ] Feng Honghua commented on HBASE-8753: - Thanks [~yuzhih...@gmail.com] and [~lhofhansl] Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes, Scanners Affects Versions: 0.95.1 Reporter: Feng Honghua Assignee: Feng Honghua Attachments: 8753-trunk-V2.patch, 8753-trunk-v4.txt, HBASE-8753-0.94-V0.patch, HBASE-8753-0.94-V1.patch, HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch, HBASE-8753-trunk-V3.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701465#comment-13701465 ] Feng Honghua commented on HBASE-8753: - [~lhofhansl]: Your #1/#2 comments are good. For #3, your concern is correct. And since the ScanDeleteTracker will be reset/cleared (all delete markers) after a row is done, the number of family version marker that need to track at any time during scan is all family version markers that are put to a specific row, not for all rows and not accumulated throughout the whole scan process. Though in theory it's possible that client put arbitrarily number of family version markers to a specific row/cf, but in practice this number is expected to be equal to or less than the number of different versions(timestamps) of all cells a specific row/cf contains. Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes, Scanners Affects Versions: 0.95.1 Reporter: Feng Honghua Assignee: Feng Honghua Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688936#comment-13688936 ] Feng Honghua commented on HBASE-8755: - [~zjushch] We run the same tests as yours, and below are the result: 1). One YCSB client with 5/50/200 write threads respectively 2). One RS with 300 RPC handlers, 20 regions (5 data-nodes back-end HDFS running CDH 4.1.1) 3). row-size = 150 bytes threads row-count new-throughputnew-latency old-throughput old-latency --- 5 20 3191 1.551(ms) 3172 1.561(ms) 50 200 23215 2.131(ms) 7437 6.693(ms) 200 200 35793 5.450(ms) 10816 18.312(ms) --- A). the difference is negligible when 5 threads of YCSB client B). new-model still has 3X+ improvement compared to old-model when threads are 50/200 Anybody else can help do the similar tests using the same test configuration as Chunhui? A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688957#comment-13688957 ] Feng Honghua commented on HBASE-8755: - [~zjushch]: We run the same tests as yours, and below are the result: 1). One YCSB client with 5/50/200 write threads respectively 2). One RS with 300 RPC handlers, 20 regions (5 data-nodes back-end HDFS running CDH 4.1.1) 3). row-size = 150 bytes ||client-threads ||row-count ||new-model throughput ||new-model latency ||old-model throughput||old-model latency|| |5 |20 |3191|1.551(ms) |3172 |1.561(ms)| |50 |200 |23215 |2.131(ms) |7437 |6.693(ms)| |200 |200 |35793 |5.450(ms) |10816 |18.312(ms)| A). the difference is negligible when 5 threads of YCSB client, this is because B). new-model still has 3X+ improvement compared to old-model when threads are 50/200 Can anybody else help do the tests using the same configurations as Chunhui? Another guess is the HDFS used by chunhui has much better performance on HLog's write/sync, which makes the new model in HBase has less impact. Just guess. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688973#comment-13688973 ] Feng Honghua commented on HBASE-8755: - Our comparison tests have only the RS bits different, and all others(client/HDFS/cluster/row-size...) remain the same. The client runs on a different machine other than the RS, we don't run client on RS because almost all our applications using HBase run their application in their own machines different from the HBase cluster. Actually we never saw a such high throughput as 18018/24691 for a single RS in our cluster. It's really weird :). A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689094#comment-13689094 ] Feng Honghua commented on HBASE-8755: - If possible, would anybody else help do the same comparison test as Chunhui/me? Thanks in advance. [~lhofhansl] [~yuzhih...@gmail.com] [~sershe] [~stack] A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689160#comment-13689160 ] Feng Honghua commented on HBASE-8721: - I list some merits with behavior 'Delete can't mask puts that happen after the delete': 1) Can avoid the inconsistency such as I mentioned above, with our patch, user can always read the put by 4. It's more natural and intuitive: 1 put a kv (timestamp = T0), and flush; 2 delete that kv using a DeleteColumn type kv with timestamp T0 (or any timestamp = T0), and flush; 3 a major compact occurs [or not]; 4 put that kv again (timestamp = T0); 5 read that kv; === a) if a major compact occurs at step 3, then step 5 will get the put written at step 4 b) if no major compact occurs at step 3, then step 5 get nothing 2) Can provide strong guarantee for such operation: I don't know which/how-many versions in a cell, now I (by removing all existing ones) just want to put a new version into it and ensure only this new put in the cell regardless of the ts comparison with old existing ones (I think this operation/guarantee is useful in many scenarios). Current delete behavior can't provide such guarantee. 3) 'delete latest version'(deleteColumn() without ts) can be tuned to remove the read (latest version for its ts) during 'deleteColumn'. Current delete behavior can't be tuned to remove the read operation during 'deleteColumn' 4) 'new put can't be masked (disappear) by old/existing delete' itself is a merit for many use-cases / application since it's more natural and intuitive. I ever explained many times to different customers for the old semantics of version/delete and without exception all the first responses from them are weird... why so? Per my understanding, contrary to [~lhofhansl] and [~sershe], 'timestamp' is just a long type to determine versions' ordering using the rule of 'the bigger/later wins', and it happens the timestamp in 'time' semantic is a long type and new put with its 'current' timestamp has bigger timestamp, and in most cases new put versions knock out older ones. And for many use cases time-semantic for 'timestamp' is enough for the real-world requirement, but by design it's not always the case, otherwise the timestamp won't be exposed for user to set it explicitly. In a word, as long as user knows 'timestamp' is just only the dimension of long type to determine the version ordering using the rule 'the bigger wins', he can reason out the result of any operation sequences. In essence 'timestamp as a dimension for version ordering' doesn't related to delete semantic. -- I know my understanding is arguable for many guys, since the old delete semantic and behavior has existed for so long and everybody has already taken it for granted (I mean no offence here) At last I also list the downside of proposed optional solutions I received: A 'KEEP_DELETE_CELLS' is definitely a nice feature, but many users don't need this feature (to time-travel or trace-back action history) and this feature prevent major-compact to shrink data-set by collecting. B disallow user explicitly set timestamp, this treatment limits HBase's schema flexibility, and prohibit many innovative design such as facebook's message search index, and at last it can't guarantee unique timestamp hence can still lead to tricky / confusing behavior. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689229#comment-13689229 ] Feng Honghua commented on HBASE-8753: - [~lhofhansl] For the backwards-compatibility, when old RS processes the DeleteFamilyVersion type kv (either written from new client, or the two scenarios you mentioned regarding rolling restart), the DeleteFamilyVersion can enter ScanDeleteTracker, and its only effect it has is when no DeleteColumn for null column with the same timestamp as this DeleteFamilyVersion, this DeleteFamilyVersion can delete the KV (column=null) with the same timestamp (a bit like the Delete(DeleteVersion) with the same timestamp), and no other side-effect. In summary: DeleteFamilyVersion masks all the versions with a given timestamp under a CF, and when an old RS receives it(written from new client, or the two scenarios mentioned regarding rolling restart), the old RS treats it like it's a Delete(DeleteVersion) for null column. Nothing else. I think this side-effect is acceptable. Your opinion? Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689935#comment-13689935 ] Feng Honghua commented on HBASE-8721: - [~lhofhansl] Now we're extending this to Puts as well, and are saying that a Put that hits the RegionServer later should be considered newer even if its TS is old, this opens another can of worms === Maybe you misunderstand me here, I never proposed 'a Put that hits the RegionServer later should be considered newer even if its TS is old'. The sequence 'put T3, put T2, put T1' (where T3T2T1) to a CF with max-version = 2 will result in (T3,T2) and T3 is the first version, though T1 is the last one hits RS, this is what I mean by 'timestamp is the only dimension which determines version ordering/survival by rule 'the bigger wins'' === What I proposed is this (can via a config to provide customers the option if they want this behavior) : the delete masks (existing) puts with timestamps less than or equal to its (not changed); and customers can choose whether the delete can mask puts still not written to HBase (future puts) according their individual real-world application logic / requirement. KEEP_DELETED_CELLS would still work fine, but their main goal is to allow correct point-in-time-queries, which among others is important for consistent backups === KEEP_DELETED_CELLS indeed can prevent the inconsistency in the example scenario 'put - delete - (major-compact) - put - get', and it provides a consistent result of 'get nothing'. But this result is also unacceptable for our customers since they expect the later 'put' not masked by the earlier delete. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689938#comment-13689938 ] Feng Honghua commented on HBASE-8721: - [~apurtell] / [~lhofhansl]: Maybe I miss something here, is there -1 for providing config to the adjustment? Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689947#comment-13689947 ] Feng Honghua commented on HBASE-8721: - [~sershe] btw, HBase does support point version deletes as far as I see. So specific version can be deleted if desired. Should we add APIs to delete latest version? We can even add API to delete all existing versions, won't be very efficient with many versions (scan or get+bunch of deletes on server side), but it will work without changing internals === Yes, I know what you mean. But what I mean is that the deleteColumn (without providing timestamp, AKA the 'delete latest version') is not efficient since it incurs a 'read' in RS to get the timestamp of the latest version (and set it to the Delete type KV). This operation can be tuned by removing the 'read' in RS. You can find the implementation detail in one of my above comments. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689953#comment-13689953 ] Feng Honghua commented on HBASE-8753: - By design, DeleteFamilyVersion KV's qualifier is null, hence its comparison to any qualifier must be == 0 or 0, so execution can never enter the branch of throw new IllegalStateException(isDelete failed: deleteBuffer= when an old RS handles a DeleteFamilyVersion. (DeleteFamilyVersion will be mis-explained as Delete type with qualifier = null) And I'll do the backwards-compatibility test as you suggest. Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689975#comment-13689975 ] Feng Honghua commented on HBASE-8721: - [~sershe] : It still need 10+ms, at best, for checkAndPut / increment when the KV read isn't in RS's block-cache and need access to HDFS, though this read in RS can save 2 rpcs. Actually two months ago one of our Xiaomi's internal customers gave up using checkAndPut since they can't afford the poor performance, though they did admit they love the atomicity checkAndPut provides. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687810#comment-13687810 ] Feng Honghua commented on HBASE-8753: - [~lhofhansl] It would be better to have it in 0.94. Thanks. Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688739#comment-13688739 ] Feng Honghua commented on HBASE-8755: - Thanks Chunhui for this verification test. We didn't test with small/medium write pressure, we'll do tests with small-medium write pressure soon and provide the numbers when done. A quick response on your test result: We never saw a such high throughput as 24691 for a single RS in cluster before we applied the new write thread model. We ever did a series of stress test for write throughput and the maximum we ever got is about 1 using 1 YCSB client. Without patch: Write Threads: 200 Write Rows: 200 Consume Time: 80s Avg TPS: 24691 With patch: Write Threads: 200 Write Rows: 200 Consume Time: 64s Avg TPS: 30769 A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688754#comment-13688754 ] Feng Honghua commented on HBASE-8755: - [~zjushch] What's the row-size used in your test? You tested against a back-end HDFS, not local disk, right? And would you test using a bit more test data(such as 5,000,000 - 10,000,000 rows)? Thanks. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686700#comment-13686700 ] Feng Honghua commented on HBASE-8755: - Thanks [~yuzhih...@gmail.com] and [~stack] for the detailed review. I make and attach a update patch based on trunk according to your reviews. Below are answers to some important questions Ted/stack raised in the reviews (I have already answered some from Ted in above comment): [Ted] AsyncNotifier does notification by calling syncedTillHere.notifyAll(). Can this part be folded into AsyncFlusher ? === AsyncNotifier will compete syncedTillHere with all the write handler threads(which may finish the appendNoSync but not pend on syncer()). The performance is better by separating AsyncSyncer(which just get notified, do 'sync' and then notify AsyncNotifier) and AsyncNotifier(get notified by AsyncSyncer and wake-up all pending write handler threads) [stack] Any idea of its effect on general latencies? Does HLogPerformanceEvaluation help evaluating this approach? Did you deploy this code to production? === I don't run HLogPerformanceEvaluation for performance comparison. instead I used 5 YCSB clients to concurrently press on a single RS with a 5 data-node underlying HDFS. Everything are the same for test with Old/New write thread models except the RS bits are different. We are testing it in the test cluster for a month, but not deployed to production yet. Below is the detailed performance comparison for your reference. a 5 YCSB clients, each with 80 concurrent write theads (auto-flush = true) b each YCSB writes 5000,000 rows c all 20 regions of the target table are moved to a single RS Old write thread model: row size(bytes) latency(ms) QPS -- 200037.310715 100032.812149 500 30.912891 200 26.914803 10 24.516288 New write thread model: row size(bytes) latency(ms) QPS --- 200017.323024 100012.631523 500 11.733893 200 11.434876 10 11.135804 [stack] Can I still (if only optionally) sync every write as it comes in? (For the paranoid). === can't for now, I'll consider how to make it configurable later on. [stack] Regards the above, the test is no longer valid given the indirection around sync/flush? === Yes, that test is not valid by new write thread modeldeferred log flush [stack] To be clear, when we call doWrite, we just append the log edit to a linked list? (We call it a bufferLock but we just doing append to the linked list?) === Yes, in both old and new write thread models what doWrite does is just appending log edit to a linked list which plays a role as a 'local' buffer for log edits what don't hit hdfs deferred log flushyet. [stack] How does deferred log flush still work when you remove stuff like optionalFlushInterval? You say '...don't pend on HLog.syncer() waiting for its txid to be sync-ed' but that is another behavior than what we had here previously. === When say 'still support deferred log flush' I mean for 'deferred log flush' it can still response write success to client without wait/pend on syncer(txid), in this sense, the AsyncWriter/AsyncSyncer do what the previous LogSyncer does from the point view of the write handler threads: clients don't wait for the write persist before get reponse success. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster
[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8755: Attachment: HBASE-8755-trunk-V0.patch new write thread model patch based on trunk A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8755: Attachment: HBASE-8755-0.94-V1.patch an update patch based on 0.94 according to Ted/stack's review attached A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686711#comment-13686711 ] Feng Honghua commented on HBASE-8755: - HBASE-8755-trunk-V0.patch also includes changes according to the review comment from Ted/stack. Thanks again for Ted/stack for the detailed review :-) A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686818#comment-13686818 ] Feng Honghua commented on HBASE-8755: - [~otis] Yes, QPS really means Writes Per Second here. A typo. btw: My name is Feng Honghua, not Feng Hua :-) A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686862#comment-13686862 ] Feng Honghua commented on HBASE-8721: - Another benefit from behaviour delete can't mask puts happened after it(in essence mvcc also participates in delete handling): the 'delete latest version'(deleteColumn() without timestamp) can have better performance by removing the read operation in RS which is to get the timestamp of the latest version and set to the delete. Below is the update process for 'delete latest version' (under 'delete can't mask puts happened after it' behaviour): 1. deleteColumn() (without timestamp) issued by client, its timestamp is set to an 'invalid' value (0/-1 is a good candidate) to indicate 'delete the latest version'. RS just puts this Delete type kv as other type deletes without read operation. 2. when Get/Scan, by timestamp=0/-1 we know this delete is to delete the latest version and check the kv it sees. And we know the first kv with mvcc 'mvcc of this delete' is the 'latest' version when the delete enters RS. After delete(mask) this first kv (with mvcc checked) this 'delete latest version' delete also need to be removed from the ScanDeleteTracker. That's all. Then why we can't achieve such light-weight(without read) 'delete latest version' delete? The root cause is the 'delete can mask puts that happen after it' behaviour, which doesn't use mvcc in delete handling. When issuing 'delete latest version'(deleteColumn() without timestamp), the real semantic is 'to delete the latest one of all the currently EXISTING versions', the EXISTING means the one happened BEFORE the delete enters RS, and BEFORE is a concept of operation happening order (indicated by mvcc), which can't be represented by timestamp. Then why we can't handle 'delete latest version' without a read, as above process? Because newer version can be put which has the bigger timestamp (later than the 'current' latest when delete enters RS, by timestamp), and by behaviour 'delete can mask puts happened after delete'(its essence is to determine whether a kv masked by a delete only by comparing their timestamps) a 'delete latest version' delete can't tell whether the first version it sees is the latest version when itself hit RS (in fact it can use mvcc to get this information, but it doesn't) Certainly we can use mvcc only for 'delete latest version' to get the (remarkable) performance gain by removing the read operation, but it sounds inconsistent in that we handle deletes internally in different ways (one use mvcc, other don't) Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8753: Attachment: HBASE-8753-trunk-V0.patch [~zjushch]/[~stack] patch for trunk attached, thanks in advance for the review Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687506#comment-13687506 ] Feng Honghua commented on HBASE-8753: - Answer some questions as below: [chunhui] Could reuse the tag of VERSION_DELETED (other than introducing a new FAMILY_VERSION_DELETED)? === Per my understanding, the result tag is indicating the kv is masked by which type of delete. Introducing a new tag here makes sense since it distinguishes itself from VERSION_DELETED which means 'masked by a DeleteColumn'. [chunhui] Maybe we need a better name instead of DeleteFamilyVersion === DeleteFamilyVersion is the best name I can come up with till now. We can keep it until you recommend a better one. :-) [stack] Are there changes to KV missing? === No. Let me know if you feel something missing/wrong with KV. Thanks :-) [[~lhofhansl]] : Do you want to only delete columns of a specific version, or columns older than a specific version? === What this patch does is the former: only delete columns of a specific version (without providing their column-names) In your use-case, do you ever want to keep version X but target target X+1 for delete? === Yes, this is exactly the effect this patch aims for Will this break backwards-compatibility during rolling restarts? (because of the new KV type) === Yes, old RS bits will ignore DeleteFamilyVersion type KV written by new client. Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685019#comment-13685019 ] Feng Honghua commented on HBASE-8753: - Thanks chunhui for the review, I'll make a patch for trunk soon For the name, do you have a better alternative? Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
Feng Honghua created HBASE-8755: --- Summary: A new write thread model for HLog to improve the overall HBase write throughput Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8755: Attachment: HBASE-8755-0.94-V0.patch the patch HBASE-8755-0.94-V0.patch is based on http://svn.apache.org/repos/asf/hbase/branches/0.94 A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7280) TableNotFoundException thrown in peer cluster will incur endless retry for shipEdits, which in turn block following normal replication
[ https://issues.apache.org/jira/browse/HBASE-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685495#comment-13685495 ] Feng Honghua commented on HBASE-7280: - [~jeason] You can refer to HBASE-8751 for per-peer cf/table granularity replication TableNotFoundException thrown in peer cluster will incur endless retry for shipEdits, which in turn block following normal replication -- Key: HBASE-7280 URL: https://issues.apache.org/jira/browse/HBASE-7280 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Feng Honghua Original Estimate: 0.5h Remaining Estimate: 0.5h in cluster replication, if the master cluster have 2 tables which have column-family declared with replication scope = 1, and add a peer cluster which has only 1 table with the same name as the master cluster, in the ReplicationSource (thread in master cluster) for this peer, edits (logs) for both tables will be shipped to the peer, the peer will fail applying the edits due to TableNotFoundException, and this exception will also be responsed to the original shipper (ReplicationSource in master cluster), and the shipper will fall into an endless retry for shipping the failed edits without proceeding to read the remained(newer) log files and to ship following edits(maybe the normal, expected edit for the registered table). the symptom looks like the TableNotFoundException incurs endless retry and blocking normal table replication -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685655#comment-13685655 ] Feng Honghua commented on HBASE-8721: - Thanks guys for your feedback: [~apurtell], [~sershe], [~stack], [~lhofhansl] I summarize issues/proposals as below: A). We all agree this IS a bug: 1 put a kv (timestamp = T0), and flush; 2 delete that kv using a DeleteColumn type kv with timestamp T0 (or any timestamp = T0), and flush; 3 a major compact occurs [or not]; 4 put that kv again (timestamp = T0); 5 read that kv; a) if a major compact occurs at step 3, then step 5 will get the put written at step 4 b) if no major compact occurs at step 3, then step 5 get nothing B). [~stack] proposes to keep all deleted cells. This can be achieved either by turning on the KeepDeletedCells for ColumnFamilies or by degenerating major-compact to minor-compact (I guess you mean the former one). But these two options both result in a bigger data size than expectation. C). [~lhofhansl] suggests to introduce a config for a Table/CF to disallow client to set timestamps when put. As a config, it means client still can create tables/CFs that allow him to explicitly set timestamps, and for these tables/CFs, bug of A) still exists. D). As [~lhofhansl] said, timestamp is part of the Schema, it's visible to and can be set by client, hence it can be exploited by client for more general usage. For 'general' I mean it's not limited for only 'time' semantic, but as an ordinary dimension of a cell's coordinate. Such treatment can lead to many innovative schema design to address more complicated real-world problems. Facebook uses msg-id as timestamp in their message search index CF. When using timestamp as an ordinary dimension of a cell's coordinate, that cell naturally has only one 'version' in the app context, and the CF usually to set the MaxVersions in HBase context to the max-size for accommodate as many different cells as possible. The client who uses timestamp as such general usage takes care of all the subtlety derived from this semantic change. Facebook's design details can be referred to in book 'HBase The Definitive Guide' - Chapter 9 Advanced Usage - Search Integration (page 374) or blog: http://www.facebook.com/notes/facebook-engineering/inside-facebook-messages-application-server/10150162742108920. Disabling client set timestamps or limiting timestamp only with 'time' semantic will prohibit such innovative usage of timestamp. As said, a good language/platform/product encourages and enables innovative extension/usage out of the original designer's imagination. We do expect HBase to be such a good platform/product, right? E). [~apurtell] said: This section of the book describes expected behavior. This is not a bug. I disagree. That section's title explicitly says it's 'current limitations' and explains in details why. It is by nature not an acceptable behaviour. It's counter-common-sense and counter-intuition. It now seems an 'expected behaviour' JUST because it exists from the very beginning. F). [~lhofhansl] said: HBase allows you to set the timestamps to influence the logical order in which things (are declared to have) happened. If you do not want strange behavior do not date Deletes into the future and Puts into the past. Period. As bug in A), strange behaviour occurs even dating Deletes/Puts into the same timestamp, but one the future and the other the past. (We allow setting timestamp, and we do set it) We get strange(buggy) behaviour when we put - delete - put - get that very same KV with that same timestamp. Isn't it weird? G). [~lhofhansl] said: If we did not have that as-of-time queries would be broken and we would break the idempotent nature of operations in HBase For idempotent nature of operations in HBase, my understanding is a series of Puts(or Deletes) for a same cell(exactly the same coordinate:value) will result in an eventually same result. But it's expected to be broken if interleaved by Deletes(Deletes interleaved by Puts). Such idempotent nature break is acceptable according to my opinion. Even we don't change the behaviour 'Deletes can mask puts that happen after the delete, scenario in A) still breaks the idempotent nature: we put that same cell multiple times, but the results can turn out to be different when interleaved by Deletes (with the effect of major compact together). H). Since HBase is modeled after BigTable, so it makes sense we align the Delete behaviour here with BigTable, right? I). At last, I think we need to have an open mind for this issue, not just suggesting a workaround at the cost of HBase's inherent flexibility. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL:
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput
[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685671#comment-13685671 ] Feng Honghua commented on HBASE-8755: - Thanks [~yuzhih...@gmail.com] for the detailed code review, answers to some important questions as below: 1) Deferred log sync is still supported, in which case the write handler thread in Region don't pend on HLog.syncer() waiting for its txid to be sync-ed. This behaviour keeps intact. 2) Good catch, if (txid = this.pendingTxid) return; in setPendingTxid() moved into below synchronized (this.writeLock) {...} is OK. bufferLock is used only to guarantee the access to HLog's local pending buffer and unflushedEntries can't be interleaved by multiple threads. 3) failedTxid is still assigned to this.lastWrittenTxid. Is that safe ? Yes. all write pending on syncer() with txid = failedTxid will get failure. The lastWrittenTxid can safely proceed without incorrect behaviour, and it need to proceed to eventually wake-up pending write handler threads. I'll update the patch per your review tomorrow. A new write thread model for HLog to improve the overall HBase write throughput --- Key: HBASE-8755 URL: https://issues.apache.org/jira/browse/HBASE-8755 Project: HBase Issue Type: Improvement Components: wal Reporter: Feng Honghua Attachments: HBASE-8755-0.94-V0.patch In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) = HLog writer append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock. The only optimization where checking if current syncTillHere txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation. Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 7+). I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested. The change for new write thread model is as below: 1 All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer) 2 All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid; 3 An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync) 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases) 5 An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function 6 No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685699#comment-13685699 ] Feng Honghua commented on HBASE-8721: - [~apurtell] Sorry if I offend. And thanks for your patient comment again. All we need here is to clarify the issue and make HBase better. I don't think that is a correct statement. == You mean the behaviour of A) is correct and acceptable? Can you say a bit more about what exactly your clients are doing with timestamps? == We have a similar usage of timestamp as Facebook, I provide the link/reference describing how facebook use timestamp for their message search index. In short, msg-id is used as timestamp to provide a reverse-link for the term/token appeared in a msg to find its original msg. When a msg is deleted, all the kv of its related term/token will be deleted as well, and when user revoke the deleted msg from the deleted-folder, the term/token kvs will be inserted again, but with the current delete behaviour the re-inserting of term/token kvs for revoked msg can be inconsistent, the same scenario as A) For [~sershe]'s explanation: If you are setting explicit timestamps, you are explicitly telling HBase that it should withhold judgement about versions because you know what happens logically before and after in your system. If you are using timestamp otherwise for some convenience, you are misusing it. == we set explicit timestamps and we don't want judgement about versions(refer to below description of our scenarios). but the behaviour 'Deletes mask puts that happen after the delete' put us in a difficult situation. Actually if we set explicit timestamp, the timestamp can't be the 'current' time when the put hit RS, so this timestamp can seldom has 'time' semantic in this sense since it's inaccurate for time ordering. so If you are using timestamp otherwise for some convenience, you are misusing it almost equals to setting explicit timestamps is misusing it? If this version semantic is removed, timestamp becomes simply a long tucked unto a KeyValue and should be removed, after all, we don't have a string or a boolean also added to KeyValue so that people could use them for their purposes. HBase already has columns and column families to do that. Timestamp has very explicit semantics and purpose right now. If you want time-based behavior then don't set timestamps and HBase will use time-based behavior. == another 'long' tucked onto a KeyValue is not unneccessary, even HBase already has columns and column-families. In facebook message search index scenarios, using msg-id as timestamp is an innovative way to build the reverse lookup index atomically by leverage the row-transaction. Otherwise the reverse lookup index can't be built atomically since the msg and the msg-search-index of a given user can span multiple rows. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685702#comment-13685702 ] Feng Honghua commented on HBASE-8753: - OK. thanks [~yuzhih...@gmail.com] for the code review. Was deleteFamilyStamp the previous name for this feature ? == No. can't come up with a better name for this log then. deleteFamilyVersion has no 'timestamp' meaning, and deleteFamilyVersionStamp is too long, so just remove the 'Version' and remain 'Stamp' to use it to indicate 'timestamp'... Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686302#comment-13686302 ] Feng Honghua commented on HBASE-8721: - Another some drawbacks for the config of disallowing client set timestamps explicitly from [~lhofhansl]: 1). No easy way to delete a specific version except the latest version: Client need to read all versions out, get the timestamp of the version he want to delete, then issue the deleteColumn(); The reason is client doesn't know the exact timestamps of each version when Put 2). Performance is poor for deleting a version (rather than all versions of that cell): All delete for version need to read the timestamp before deleting, the deleteColumn() without timestamp for deleting the latest version also need to read the latest timestamp in RS, though transparent to the client 3). If put without setting timestamp, multiple puts for the same KV (in perspective of the client it's the same) will get different timestamps when hitting RS and actually are not a same KV in perspective of HBase, they occupy multiple versions which knock out earlier 'real' versions; different times of Puts of a 'same' KV(without timestamp) from client can result in different 'version list' of that cell in HBase. This is not idempotent in the strict sense. 4). Even disallowing explict set timestamp, strange behavior can still arise due to clock skew or timestamp's time granularity(Puts/Deletes can have a same timestamp in milli-second). HBASE-2256 is an example of the latter. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686327#comment-13686327 ] Feng Honghua commented on HBASE-8721: - On the contrary, personally I think allowing client explicitly set timestamp is giving client the ability to EXACTLY control the versions semantic, without the impact from clock skew or HBASE-2256. By explicitly setting timestmaps for each KVs he put, client know exactly at any time point which versions will survive without worrying about exceptional cases such as clock skew or HBASE-2256. Below are 'acceptable' behaviors regarding delete/version from my point of view. 1 version is determined by 'timestamp' only(same as current semantic), HBase(we) determines which version survive (in Scan/Compact etc.) only by timestamp. 2 delete can only mask puts happened before it('before' here is measured by vector clock, mvcc in HBase, not by timestamp). All puts happened before a delete are candidates to be masked by that delete, but whether a candidate put will be actually masked by that delete further depends on whether the candidate put's timestamp is smaller than or equal to delete's timestamp. So delete's semantic is: to delete an existing exact version (deleteColumn) or all existing smaller versions (deleteColumns / deleteFamily) These two version/delete semantics/behaviors have no conflicts. Deletes can mask puts that happen after the delete -- Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
Feng Honghua created HBASE-8751: --- Summary: Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
[ https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8751: Attachment: HBASE-8751-0.94-V0.patch this patch is based on code from http://svn.apache.org/repos/asf/hbase/branches/0.94 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Attachments: HBASE-8751-0.94-V0.patch Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
Feng Honghua created HBASE-8753: --- Summary: Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp
[ https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8753: Attachment: HBASE-8753-0.94-V0.patch patch HBASE-8753-0.94-V0.patch is based on http://svn.apache.org/repos/asf/hbase/branches/0.94 Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp --- Key: HBASE-8753 URL: https://issues.apache.org/jira/browse/HBASE-8753 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Feng Honghua Attachments: HBASE-8753-0.94-V0.patch In one of our production scenario (Xiaomi message search), multiple cells will be put in batch using a same timestamp with different column names under a specific column-family. And after some time these cells also need to be deleted in batch by given a specific timestamp. But the column names are parsed tokens which can be arbitrary words , so such batch delete is impossible without first retrieving all KVs from that CF and get the column name list which has KV with that given timestamp, and then issuing individual deleteColumn for each column in that column-list. Though it's possible to do such batch delete, its performance is poor, and customers also find their code is quite clumsy by first retrieving and populating the column list and then issuing a deleteColumn for each column in that column-list. This feature resolves this problem by introducing a new delete flag: DeleteFamilyVersion. 1). When you need to delete all KVs under a column-family with a given timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / Delete) without read operation; 2). Like other delete types, DeleteFamilyVersion takes effect in get/scan/flush/compact operations, the ScanDeleteTracker now parses out and uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan result (also in flush/compact). Our customers find this feature efficient, clean and easy-to-use since it does its work without knowing the exact column names list that needs to be deleted. This feature has been running smoothly for a couple of months in our production clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster
[ https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13684933#comment-13684933 ] Feng Honghua commented on HBASE-8751: - Showing up in the 'table:cf-list list' doesn't guarantee a CF will be replicated: only CF with replication-scope=1 in the 'table:cf-list list' will be replicated. Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster Key: HBASE-8751 URL: https://issues.apache.org/jira/browse/HBASE-8751 Project: HBase Issue Type: Improvement Components: Replication Reporter: Feng Honghua Attachments: HBASE-8751-0.94-V0.patch Consider scenarios (all cf are with replication-scope=1): 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C has cf1,cf2. 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from cluster S. 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S. Current replication implementation can't achieve this since it'll push the data of all the replicatable column-families from cluster S to all its peers, X/Y in this scenario. This improvement provides a fine-grained replication theme which enable peer cluster to choose the column-families/tables they really want from the source cluster: A). Set the table:cf-list for a peer when addPeer: hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; table3:cf2 B). View the table:cf-list config for a peer using show_peer_tableCFs: hbase-shell show_peer_tableCFs 1 C). Change/set the table:cf-list for a peer using set_peer_tableCFs: hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2 In this theme, replication-scope=1 only means a column-family CAN be replicated to other clusters, but only the 'table:cf-list list' determines WHICH cf/table will actually be replicated to a specific peer. To provide back-compatibility, empty 'table:cf-list list' will replicate all replicatable cf/table. (this means we don't allow a peer which replicates nothing from a source cluster, we think it's reasonable: if replicating nothing why bother adding a peer?) This improvement addresses the exact problem raised by the first FAQ in http://hbase.apache.org/replication.html: GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later? Yes, this is for much later. I also noticed somebody mentioned replication-scope as integer rather than a boolean is for such fine-grained replication purpose, but I think extending replication-scope can't achieve the same replication granularity flexibility as providing above per-peer replication configurations. This improvement has been running smoothly in our production clusters (Xiaomi) for several months. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681200#comment-13681200 ] Feng Honghua commented on HBASE-8721: - [~stack] 1. I also agree keeping deleted cells forever(even during major compact) can fix the inconsistency in the scenario I mentioned before. But that treatment will make the HFile size continuously grow without shrinking by collecting deleted cells and delete flags. This is an obvious drawback by keeping deleted cells(together with delete flags) which is not desirable by many users. 2. The behaviour delete can mask puts that happened after the delete is unacceptable for many users. When a user puts a kv to HBase, his intention is to ADD that kv to HBase and definitely he want to be able to retrieve that kv back using a Get/Scan operation without regard to whether or not there is a delete ever occurred. Why current behaviour is unacceptable for two reasons: a When a user puts a kv, receives success response, and fails to read it out, he'll be confused why and it's hard for him to realize that the reason is someone or himself ever wrote a delete before; b If delete can mask puts happened after that delete, this means once a delete is written to HBase(till it's collected by major compact), it can block that kv be added back to HBase again forever(by semantic) even though that kv can be added back to HBase successfully using 'put' operation(by syntactic) 3. Yes, my fix is really to adjust the behaviour delete can mask puts that happened after the delete to the one that delete can only mask puts that happened before(or equal) the delete. With this behaviour adjustment the inconsistency caused by major compact doesn't appear again. 4. My fix is using mvcc together with timestamp to determine whether a delete can mask a put. This treatment doesn't break the original delete semantic defined by timestamp alone, but enforced with mvcc that define the real ordering of time-point that operations occur(timestamp can't). Why I don't use sequence-id: a when flushed/compacted to HFile, sequence-id doesn't accompany kv any longer, but mvcc does; if we use seq-id, we can't handle kv in hfile for this purpose; b Yes the seq-id defines persistence ordering of kv and the mvcc defines visibility ordering of kv, and they can interleave for two kvs, but it doesn't hurt the correctness of our adjustment behaviour in that when seq-id advances user can't see(read) that kv until its mvcc advances. visibility occurs after persistence(mvcc is after seq-id). seq-id is background implementation detail and user isn't aware of it while mvcc impacts data visibility and user is aware of it. fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681204#comment-13681204 ] Feng Honghua commented on HBASE-8721: - [~sershe] This fix is not about timestamp as a role to control version, but whether the behaviour 'delete can mask puts that happened after the delete is reasonable/acceptable. I totally agree with you that timestamp is the only criterion to define/control version semantic, and this isn't broken by my fix. You can get what I mean by my above comment to Stack or by review and understand my patch. Thanks a lot fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680323#comment-13680323 ] Feng Honghua commented on HBASE-8721: - [~sershe] If we want to keep the behaviour that delete can mask puts that happened after the delete, to fix the inconsistency issue caused by major compact, the only alternative is to keep the delete markers forever, as you said. But I think the inconsistency issue's root cause is the arguable behaviour that delete can mask puts that happened after the delete. A more intuitive and more reasonable behaviour is that a delete can only mask puts happened before it, and has no impact on puts happened after it. (This behaviour has nothing to do with another behaviour that timestamp determines which kv survives regarding version semantic.) And if we choose this adjusted behaviour, we can fix the inconsistency issue just with the help of mvcc, and collect the delete markers during major compact as before (no need to keep them forever to fix that inconsistency) A obvious, and ridiculous drawback of the behaviour that delete can mask puts that happened after the delete is that when an end user puts a kv, gets success response but it turns out that he can't read out that kv just because someone(maybe this someone is himself, but he can't realize this) ever made a delete that can mask this kv...this sounds really uncanny and weird. Turns back to scenarios that timestamp is used as another ordinary dimension without time semantic, in those cases we declare max(int) for the versions, and in that scheme timestamp isn't used to control version count but as an ordinary dimension to locate a cell. And each cell has a single version. So no problem. I agree we can introduce a config knob to enable the new behaviour. fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679339#comment-13679339 ] Feng Honghua commented on HBASE-8721: - [~andrew.purt...@gmail.com] Consider this scenario: first put a KV with timestamp T0, and then delete it with timestamp T1 (T1T0), and then put that KV again with timestamp T0. Now whether a get/scan can read out KV depends on whether a major compact occurs between the delete and second put, if there is a major compact occurs during that time frame, the delete will be collected by the major compact, therefore the second put survives and can be read out by a get/scan, on contrary if no major compact occurs during that time frame the second put and the first put will both be masked by the delete and can't be read out by get/scan. This means by current delete handling mechanism data's visibility sometimes depends on if a major compact occurs during a tricky time frame, this behaviour is weird and unacceptable since major compact should be transparent to end users and should by no means have impact on data's visibility. And the behaviour that a later put can be masked by a earlier delete itself is somewhat weird and can make end users confusing in many scenarios. Some HBase users in our company complain about this behaviour and claim this behaviour is unacceptable from their end users' viewpoint. And I also noticed there is a same discussion on this issue by jira HBASE-2256 fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680126#comment-13680126 ] Feng Honghua commented on HBASE-8721: - [~sershe]/[~apurtell] The scenario is refined as below: 1. put a kv with timestamp T0, and flush; 2. delete that kv with timestamp T1, and flush; 3. a major compact occurs [or not]; 4. put that kv again with timestamp T0; 5. read that kv; a). if a major compact occurs at step 3, then step 5 will get the put written at step 4 b). if no major compact occurs at step 3, then step 5 get nothing I think this is a BUG. And I also DON'T think the behaviour where deletes masks later puts with the same ts is expected. In some real-world scenarios where timestamp is used NOT with time semantic but as another ordinary dimension of the kv's coordinate, user puts a kv, deletes it, and some time later puts that kv again, finds the write succeeds but can't read it out. Current delete behaviour limits such extended usage of timestamp dimension. Do we accept this incorrect/buggy behaviour as is just because it exists not fixed for a so long time? fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
Feng Honghua created HBASE-8721: --- Summary: fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Honghua updated HBASE-8721: Attachment: HBASE-8721-0.94-V0.patch fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678994#comment-13678994 ] Feng Honghua commented on HBASE-8721: - This fix uses mvcc together with timestamp to determine if a put will be masked by a delete. So the put happened after a delete can't be masked by that delete though according to timestamp it will be masked(before this fix). In this theme mvcc will be kept intact unconditionally (not set to 0) when doing flush/compact, and furthermore mvcc also need to be serialized(and later unserialized) to HLog for possible replay, and max mvcc in recovered split HLog also need to be reflected in the recovered HRegion during its initializing phase. fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered
[ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678995#comment-13678995 ] Feng Honghua commented on HBASE-8721: - this patch is based on code downloaded from http://svn.apache.org/repos/asf/hbase/branches/0.94 and all the unit-tests pass in my local environment fix for bug that delete can mask puts that happened after the delete was entered Key: HBASE-8721 URL: https://issues.apache.org/jira/browse/HBASE-8721 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Attachments: HBASE-8721-0.94-V0.patch this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1: Deletes mask puts, even puts that happened after the delete was entered. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything = T. After this you do a new put with a timestamp = T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8357) current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server
[ https://issues.apache.org/jira/browse/HBASE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634718#comment-13634718 ] Feng Honghua commented on HBASE-8357: - OK, I'm using hbase 0.94.3 and came up with this issue when reading code of replication rs failover. Very glad it's fixed already. Please Lars help close it as duplicated or fixed? Thanks current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server Key: HBASE-8357 URL: https://issues.apache.org/jira/browse/HBASE-8357 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.3 Reporter: Feng Honghua consider this scenario: rs A/B/C, A dies, B and C race to lock A to help replicate A's left unreplicated hlogs, B wins and successfully creates lock under A's znode, but before B copies A's hlog queues to its own znode, B also dies, and C successfully creates lock under B's znode and helps replicate B's own left hlogs. But A's left hlogs can't be replicated by any other rs since B left back a lock under A's znode and B didn't transfer A's hlog queues to its own znode before B dies. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8357) current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server
Feng Honghua created HBASE-8357: --- Summary: current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server Key: HBASE-8357 URL: https://issues.apache.org/jira/browse/HBASE-8357 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.3 Reporter: Feng Honghua consider this scenario: rs A/B/C, A dies, B and C race to lock A to help replicate A's left unreplicated hlogs, B wins and successfully creates lock under A's znode, but before B copies A's hlog queues to its own znode, B also dies, and C successfully creates lock under B's znode and helps replicate B's own left hlogs. But A's left hlogs can't be replicated by any other rs since B left back a lock under A's znode and B didn't transfer A's hlog queues to its own znode before B dies. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7632) fail to create ReplicationSource if ReplicationPeer.startStateTracker checkExists(peerStateNode) and find not exist but fails in createAndWatch due to client/shell is d
[ https://issues.apache.org/jira/browse/HBASE-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558511#comment-13558511 ] Feng Honghua commented on HBASE-7632: - attach the call stack: 2013-01-10 09:22:03,084 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/sdtst-miliao/replication/peers/34/peer-state already exists and this is not a retry 2013-01-10 09:22:03,084 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Error while adding a new peer org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/sdtst-miliao/replication/peers/34/peer-state at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:420) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:402) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:852) at org.apache.hadoop.hbase.replication.ReplicationPeer.startStateTracker(ReplicationPeer.java:82) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:344) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:307) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$PeersWatcher.nodeChildrenChanged(ReplicationSourceManager.java:511) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:316) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) fail to create ReplicationSource if ReplicationPeer.startStateTracker checkExists(peerStateNode) and find not exist but fails in createAndWatch due to client/shell is done creating it then, now throws exception and results in addPeer fail -- Key: HBASE-7632 URL: https://issues.apache.org/jira/browse/HBASE-7632 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2, 0.94.3, 0.94.4 Reporter: Feng Honghua Original Estimate: 48h Remaining Estimate: 48h fail to create ReplicationSource if ReplicationPeer.startStateTracker checkExists(peerStateNode) and find not exist but fails in createAndWatch due to client/shell is done creating it then, now throws exception and results in addPeer fail -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira