[jira] [Commented] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions

2013-09-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768293#comment-13768293
 ] 

Feng Honghua commented on HBASE-9480:
-

bq.(Cc'ing Feng Honghua since he expressed interest in this area too as is 
Jimmy Xiang of course)
Yes. Seems current master/zk/RS's main communication pattern (RS updates zk 
node, master watches change of zk node), together with the asynchronous and 
'one-time' nature of zk watch, result in too many corner cases for assignment 
manager(and region split). I'm making a proposal for new master/zk/RS's 
communcation pattern. The main theme is master sends request to RS, RS 
responses the progress back to master, master persists the request progress in 
another system table(like meta table), why not zk is for better 
throughput/performance for huge table with big number of regions... [~stack] / 
[~jxiang]

 Regions are unexpectedly made offline in certain failure conditions
 ---

 Key: HBASE-9480
 URL: https://issues.apache.org/jira/browse/HBASE-9480
 Project: HBase
  Issue Type: Bug
Reporter: Devaraj Das
Assignee: Jimmy Xiang
 Fix For: 0.98.0, 0.96.0

 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, 
 trunk-9480_v1.2.patch, trunk-9480_v2.patch


 Came across this issue (HBASE-9338 test):
 1. Client issues a request to move a region from ServerA to ServerB
 2. ServerA is compacting that region and doesn't close region immediately. In 
 fact, it takes a while to complete the request.
 3. The master in the meantime, sends another close request.
 4. ServerA sends it a NotServingRegionException
 5. Master handles the exception, deletes the znode, and invokes regionOffline 
 for the said region.
 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is 
 deleted.
 The region is permanently offline.
 There are potentially other situations where when a RegionServer is offline 
 and the client asks for a region move off from that server, the master makes 
 the region offline.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9466) Read-only mode

2013-09-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768946#comment-13768946
 ] 

Feng Honghua commented on HBASE-9466:
-

[~stack] / [~jdcryans] : OK, I'll make a generic framework first and then 
implement readonly based on it.

 Read-only mode
 --

 Key: HBASE-9466
 URL: https://issues.apache.org/jira/browse/HBASE-9466
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor

 Can we provide a read-only mode for a table? write to the table in read-only 
 mode will be rejected, but read-only mode is different from disable in that:
 1. it doesn't offline the regions of the table(hence much more lightweight 
 than disable)
 2. it can serve read requests
 Comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-09-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768963#comment-13768963
 ] 

Feng Honghua commented on HBASE-8751:
-

[~jdcryans]

Thanks for the thorough code review, but below is not true:

bq. This is in ReplicationSource.removeNonReplicableEdits() and that method is 
called for each HLog.Entry, which means that you'd hit ZK from all the region 
servers for as many write calls as they are getting. That seems excessive.

== zkHelper.getTableCFs(peerId) delegates to ReplicationPeer.getTableCFs, and 
ReplicationPeer maintains the current table/cf configs in its tableCFs field 
and returns it per getTableCFs call. And ReplicationPeer has a tableCFTracker 
which is watching tableCF zk node and updates tableCFs field accordingly once 
tableCF zk node is changed(by user via shell). This process is similiar to the 
peer state(enable/disable) treatment.
So tableCF zk node will be access same times as it's updated, not same 
times ReplicationSource.removeNonReplicableEdits() is called (for each 
HLog.Entry)

 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want 
 to replicate from a source cluster
 

 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua
 Attachments: HBASE-8751-0.94-V0.patch


 Consider scenarios (all cf are with replication-scope=1):
 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
 has cf1,cf2.
 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
 cluster S.
 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.
 Current replication implementation can't achieve this since it'll push the 
 data of all the replicatable column-families from cluster S to all its peers, 
 X/Y in this scenario.
 This improvement provides a fine-grained replication theme which enable peer 
 cluster to choose the column-families/tables they really want from the source 
 cluster:
 A). Set the table:cf-list for a peer when addPeer:
   hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
 table3:cf2
 B). View the table:cf-list config for a peer using show_peer_tableCFs:
   hbase-shell show_peer_tableCFs 1
 C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
   hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2
 In this theme, replication-scope=1 only means a column-family CAN be 
 replicated to other clusters, but only the 'table:cf-list list' determines 
 WHICH cf/table will actually be replicated to a specific peer.
 To provide back-compatibility, empty 'table:cf-list list' will replicate all 
 replicatable cf/table. (this means we don't allow a peer which replicates 
 nothing from a source cluster, we think it's reasonable: if replicating 
 nothing why bother adding a peer?)
 This improvement addresses the exact problem raised  by the first FAQ in 
 http://hbase.apache.org/replication.html:
   GLOBAL means replicate? Any provision to replicate only to cluster X and 
 not to cluster Y? or is that for later?
   Yes, this is for much later.
 I also noticed somebody mentioned replication-scope as integer rather than 
 a boolean is for such fine-grained replication purpose, but I think extending 
 replication-scope can't achieve the same replication granularity 
 flexibility as providing above per-peer replication configurations.
 This improvement has been running smoothly in our production clusters 
 (Xiaomi) for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9466) Read-only mode

2013-09-16 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9466:


Assignee: Feng Honghua

 Read-only mode
 --

 Key: HBASE-9466
 URL: https://issues.apache.org/jira/browse/HBASE-9466
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor

 Can we provide a read-only mode for a table? write to the table in read-only 
 mode will be rejected, but read-only mode is different from disable in that:
 1. it doesn't offline the regions of the table(hence much more lightweight 
 than disable)
 2. it can serve read requests
 Comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769138#comment-13769138
 ] 

Feng Honghua commented on HBASE-9467:
-

[~stack] Thanks for the effort

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9467-trunk-v0.patch, HBASE-9467-trunk-v1.patch, 
 HBASE-9467-trunk-v1.patch, HBASE-9467-trunk-v1.patch


 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-09-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769173#comment-13769173
 ] 

Feng Honghua commented on HBASE-8751:
-

[~jdcryans]
bq.Normally in the HBase code when you append str or string to a method 
name, it just means that it does the same thing but returns a String instead. 
You'll have to review naming.
== what about 'getTableCFsConfig' for getTableCFsStr?

bq.Digging down to ReplicationPeer.tableCFs, shouldn't this be at least be made 
a volatile? It seems you could have some inconsistencies.
== The ReplicationPeer.tableCFs map can only be updated by the (same/single) 
event thread of zookeeper(ZooKeeperWatcher) of ReplicationPeer, and 
ReplicationSource calls  zkHelper.getTableCFs(peerId) for each hlog entry. 
Seems not a must to declare it volatile?

 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want 
 to replicate from a source cluster
 

 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua
 Attachments: HBASE-8751-0.94-V0.patch


 Consider scenarios (all cf are with replication-scope=1):
 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
 has cf1,cf2.
 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
 cluster S.
 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.
 Current replication implementation can't achieve this since it'll push the 
 data of all the replicatable column-families from cluster S to all its peers, 
 X/Y in this scenario.
 This improvement provides a fine-grained replication theme which enable peer 
 cluster to choose the column-families/tables they really want from the source 
 cluster:
 A). Set the table:cf-list for a peer when addPeer:
   hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
 table3:cf2
 B). View the table:cf-list config for a peer using show_peer_tableCFs:
   hbase-shell show_peer_tableCFs 1
 C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
   hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2
 In this theme, replication-scope=1 only means a column-family CAN be 
 replicated to other clusters, but only the 'table:cf-list list' determines 
 WHICH cf/table will actually be replicated to a specific peer.
 To provide back-compatibility, empty 'table:cf-list list' will replicate all 
 replicatable cf/table. (this means we don't allow a peer which replicates 
 nothing from a source cluster, we think it's reasonable: if replicating 
 nothing why bother adding a peer?)
 This improvement addresses the exact problem raised  by the first FAQ in 
 http://hbase.apache.org/replication.html:
   GLOBAL means replicate? Any provision to replicate only to cluster X and 
 not to cluster Y? or is that for later?
   Yes, this is for much later.
 I also noticed somebody mentioned replication-scope as integer rather than 
 a boolean is for such fine-grained replication purpose, but I think extending 
 replication-scope can't achieve the same replication granularity 
 flexibility as providing above per-peer replication configurations.
 This improvement has been running smoothly in our production clusters 
 (Xiaomi) for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-15 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768045#comment-13768045
 ] 

Feng Honghua commented on HBASE-9467:
-

[~nkeywal]
bq. I would propose to reuse RegionTooBusyException. It's seems too similar to 
RegionOverloadedException
agree. I'll make a new patch accordingly

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9467-trunk-v0.patch


 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-15 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768046#comment-13768046
 ] 

Feng Honghua commented on HBASE-9467:
-

[~tlipcon]
bq. is this a compatible change?
thanks for the reminder. per [~nkeywal]'s suggestion, there is no compatibility 
issue when we reuse RegionTooBusyException here, right?

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9467-trunk-v0.patch


 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-13 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9467:


Attachment: HBASE-9467-trunk-v0.patch

patch for trunk

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9467-trunk-v0.patch


 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-13 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766263#comment-13766263
 ] 

Feng Honghua commented on HBASE-9467:
-

Change and explanation of the patch:

1. Throw RegionOverloadedException immediately rather than wait/retry within 
HRegion when the target region is above the memstore limit, this avoid write 
requests on region above memstore limit occupying/saturating handler threads. 
This change is in HRegion.checkResources method.

2. Reuse the exception handling and retry mechanism of AsyncProcess in client 
to handle RegionOverloadedException thrown from RS. Since 
RegionOverloadedException is not a DoNotRetryIOException, it'll be handled the 
same way as other non-DoNotRetryIOException thrown from RS by AsyncProcess and 
the according request will be retried using incremental backoff. 
   In a more general sense, we can view RegionOverloadedException as another 
kind of retriable exception and reuse all the current handling for it in 
AsyncProcess/client, so no change in client side code. And if we really want to 
use exponential backoff rather than incremental backoff for 
RegionOverloadedException, as Todd suggested, we can change the code in 
AsyncProcess accordingly.

3. We also need to check memstore limit and throw RegionOverloadedException for 
'increment' and 'append' operations, since they also insert kv to memstore and 
increase its size. (checkResources is not called for these two operations in 
HRegion previously, corrected here)

4. In UT TestHFileArchiving, RegionOverloadedException is thrown during 
loadRegion and since the 'put' operations are called directly via HRegion, not 
via client/AsyncProcess, a similiar 'catch-and-wait' handling is added here to 
proceed without failure.

[~nkeywal] / [~stack] / [~tlipcon] : Any feedback for the patch? Thanks in 
advance.

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9467-trunk-v0.patch


 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764082#comment-13764082
 ] 

Feng Honghua commented on HBASE-9467:
-

I like [~tlipcon]'s idea that we reject write by RegionOverloadedException 
rather than blocking writes. This treatment can also avoid the unnecessary 
scenario where RS eventually finishes the write after the memstore flush is 
done but client gets timeout response if memstore flush takes too long

But the client receiving such exception can only perform backoff for writes 
with the same rowKey as which is responsed such exception, hence can't prevent 
writes with different rowKeys belonging to the same region from hitting the RS 
and get RegionOverloadedException as well (considering client typically is 
unaware of the region key range when doing write)

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764133#comment-13764133
 ] 

Feng Honghua commented on HBASE-9467:
-

[~liochon] maybe I used wrong term here, I meant code calling HTable.put() etc. 
to write data to HBase for 'client'. Sure HBase client knows the key range of 
region.

I can write the patch per [~tlipcon]'s solution, if no any objection. :)

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9467:


Assignee: Feng Honghua

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9500) RpcServer can't be restarted after stop for reuse

2013-09-11 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9500:
---

 Summary: RpcServer can't be restarted after stop for reuse
 Key: HBASE-9500
 URL: https://issues.apache.org/jira/browse/HBASE-9500
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Priority: Minor


Currently RpcServer is designed/implemented without inerface/capability for 
user to restart a stopped RpcServer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764146#comment-13764146
 ] 

Feng Honghua commented on HBASE-9467:
-

Thanks [~dnicolas] for the reminder, I'll take these into account.

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764150#comment-13764150
 ] 

Feng Honghua commented on HBASE-9467:
-

correct a typo in above comment: Thanks [~liochon]

btw: no way to delete/edit a submitted comment? sounds inconvenient :-(

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-11 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9467:


Priority: Major  (was: Minor)

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Assignee: Feng Honghua

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9501) No throttling for replication

2013-09-11 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9501:
---

 Summary: No throttling for replication
 Key: HBASE-9501
 URL: https://issues.apache.org/jira/browse/HBASE-9501
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua


When we disable a peer for a time of period, and then enable it, the 
ReplicationSource in master cluster will push the accumulated hlog entries 
during the disabled interval to the re-enabled peer cluster at full speed.

If the bandwidth of the two clusters is shared by different applications, the 
push at full speed for replication can use all the bandwidth and severely 
influence other applications.

Though there are two config replication.source.size.capacity and 
replication.source.nb.capacity to tweak the batch size each time a push 
delivers, but if decrease these two configs, the number of pushes increase, and 
all these pushes proceed continuously without pause. And no obvious help for 
the bandwidth throttling.

From bandwidth-sharing and push-speed perspective, it's more reasonable to 
provide a bandwidth up limit for each peer push channel, and within that 
limit, peer can choose a big batch size for each push for bandwidth efficiency.

Any opinion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9500) RpcServer can't be restarted after stop for reuse

2013-09-11 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9500:


Component/s: IPC/RPC

 RpcServer can't be restarted after stop for reuse
 -

 Key: HBASE-9500
 URL: https://issues.apache.org/jira/browse/HBASE-9500
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Reporter: Feng Honghua
Priority: Minor

 Currently RpcServer is designed/implemented without inerface/capability for 
 user to restart a stopped RpcServer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9469) Synchronous replication

2013-09-11 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9469:


Assignee: Feng Honghua

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Assignee: Feng Honghua

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9469) Synchronous replication

2013-09-11 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9469:


Priority: Major  (was: Minor)

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-09-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762885#comment-13762885
 ] 

Feng Honghua commented on HBASE-8751:
-

[~jdcryans] would you please help review this patch? thanks

 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want 
 to replicate from a source cluster
 

 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua
 Attachments: HBASE-8751-0.94-V0.patch


 Consider scenarios (all cf are with replication-scope=1):
 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
 has cf1,cf2.
 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
 cluster S.
 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.
 Current replication implementation can't achieve this since it'll push the 
 data of all the replicatable column-families from cluster S to all its peers, 
 X/Y in this scenario.
 This improvement provides a fine-grained replication theme which enable peer 
 cluster to choose the column-families/tables they really want from the source 
 cluster:
 A). Set the table:cf-list for a peer when addPeer:
   hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
 table3:cf2
 B). View the table:cf-list config for a peer using show_peer_tableCFs:
   hbase-shell show_peer_tableCFs 1
 C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
   hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2
 In this theme, replication-scope=1 only means a column-family CAN be 
 replicated to other clusters, but only the 'table:cf-list list' determines 
 WHICH cf/table will actually be replicated to a specific peer.
 To provide back-compatibility, empty 'table:cf-list list' will replicate all 
 replicatable cf/table. (this means we don't allow a peer which replicates 
 nothing from a source cluster, we think it's reasonable: if replicating 
 nothing why bother adding a peer?)
 This improvement addresses the exact problem raised  by the first FAQ in 
 http://hbase.apache.org/replication.html:
   GLOBAL means replicate? Any provision to replicate only to cluster X and 
 not to cluster Y? or is that for later?
   Yes, this is for much later.
 I also noticed somebody mentioned replication-scope as integer rather than 
 a boolean is for such fine-grained replication purpose, but I think extending 
 replication-scope can't achieve the same replication granularity 
 flexibility as providing above per-peer replication configurations.
 This improvement has been running smoothly in our production clusters 
 (Xiaomi) for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763937#comment-13763937
 ] 

Feng Honghua commented on HBASE-9468:
-

[~stack] RpcServer can't support turning on after taking down currently. We 
provide a config to determine if fail-fast for expired master and default is 
false, which can be used for small cluster without backup master. Opinion?

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua

 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-10 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9468:


Attachment: HBASE-9468-trunk-v0.patch

patch for trunk

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua
 Attachments: HBASE-9468-trunk-v0.patch


 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-10 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-9468:


Assignee: Feng Honghua

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: HBASE-9468-trunk-v0.patch


 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9464) master failure during region-move can result in the region moved to a different RS rather than the destination one user specified

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9464:
---

 Summary: master failure during region-move can result in the 
region moved to a different RS rather than the destination one user specified
 Key: HBASE-9464
 URL: https://issues.apache.org/jira/browse/HBASE-9464
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Feng Honghua
Priority: Minor


1. user issues region-move by specifying a destination RS
2. master finishes offlining the region
3. master fails before assigning it the the specified destination RS
4. new master assigns the region to a random RS since it doesn't have 
destination RS info

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9465:
---

 Summary: HLog entries are not pushed to peer clusters serially 
when region-move or RS failure in master cluster
 Key: HBASE-9465
 URL: https://issues.apache.org/jira/browse/HBASE-9465
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Replication
Reporter: Feng Honghua


When region-move or RS failure occurs in master cluster, the hlog entries that 
are not pushed before region-move or RS-failure will be pushed by original 
RS(for region move) or another RS which takes over the remained hlog of dead 
RS(for RS failure), and the new entries for the same region(s) will be pushed 
by the RS which now serves the region(s), but they push the hlog entries of a 
same region concurrently without coordination.

This treatment can possibly lead to data inconsistency between master and peer 
clusters:
1. there are put and then delete written to master cluster
2. due to region-move / RS-failure, they are pushed by different 
replication-source threads to peer cluster
3. if delete is pushed to peer cluster before put, and flush and major-compact 
occurs in peer cluster before put is pushed to peer cluster, the delete is 
collected and the put remains in peer cluster

In this scenario, the put remains in peer cluster, but in master cluster the 
put is masked by the delete, hence data inconsistency between master and peer 
clusters

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9466) Read-only mode

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9466:
---

 Summary: Read-only mode
 Key: HBASE-9466
 URL: https://issues.apache.org/jira/browse/HBASE-9466
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor


Can we provide a read-only mode for a table? write to the table in read-only 
mode will be rejected, but read-only mode is different from disable in that:
1. it doesn't offline the regions of the table(hence much more lightweight than 
disable)
2. it can serve read requests

Comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9467:
---

 Summary: write can be totally blocked temporarily by a write-heavy 
region
 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Priority: Minor


Write to a region can be blocked temporarily if the memstore of that region 
reaches the threshold(hbase.hregion.memstore.block.multiplier * 
hbase.hregion.flush.size) until the memstore of that region is flushed.

For a write-heavy region, if its write requests saturates all the handler 
threads of that RS when write blocking for that region occurs, requests of 
other regions/tables to that RS also can't be served due to no available 
handler threads...until the pending writes of that write-heavy region are 
served after the flush is done. Hence during this time period, from the RS 
perspective it can't serve any request from any table/region just due to a 
single write-heavy region.

This sounds not very reasonable, right? Maybe write requests from a region can 
only be served by a sub-set of the handler threads, and then write blocking of 
any single region can't lead to the scenario mentioned above?

Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9468:
---

 Summary: Previous active master can still serves RPC request when 
it is trying recovering expired zk session
 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua


When the active master's zk session expires, it'll try to recover zk session, 
but without turn off its RpcServer. What if a previous backup master has 
already become the now active master, and some client tries to send request to 
this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9469) Synchronous replication

2013-09-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-9469:
---

 Summary: Synchronous replication
 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor


Scenario: 

A/B clusters with master-master replication, client writes to A cluster and A 
pushes all writes to B cluster, and when A cluster is down, client switches 
writing to B cluster.

But the client's write switch is unsafe due to the replication between A/B is 
asynchronous: a delete to B cluster which aims to delete a put written earlier 
can fail due to that put is written to A cluster and isn't successfully pushed 
to B before A is down. It can be worse if this delete is collected(flush and 
then major compact occurs) before A cluster is up and that put is eventually 
pushed to B, the put won't ever be deleted.

Can we provide per-table/per-peer synchronous replication which ships the 
according hlog entry of write before responsing write success to client? By 
this we can guarantee the client that all write requests for which he got 
success response when he wrote to A cluster must already have been in B cluster 
as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9464) master failure during region-move can result in the region moved to a different RS rather than the destination one user specified

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762642#comment-13762642
 ] 

Feng Honghua commented on HBASE-9464:
-

From the perspective of user who issues the region-move request, the region is 
moved to a different RS from the one he specified and the destination RS he 
specifies is healthy.

The root cause is the RegionPlan containing the destination RS info is just 
kept in master's memory without persistence, so the new active master doesn't 
know this info when take over the active master role.

 master failure during region-move can result in the region moved to a 
 different RS rather than the destination one user specified
 -

 Key: HBASE-9464
 URL: https://issues.apache.org/jira/browse/HBASE-9464
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Feng Honghua
Priority: Minor

 1. user issues region-move by specifying a destination RS
 2. master finishes offlining the region
 3. master fails before assigning it the the specified destination RS
 4. new master assigns the region to a random RS since it doesn't have 
 destination RS info

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762658#comment-13762658
 ] 

Feng Honghua commented on HBASE-9465:
-

[~jdcryans] I have a draft for a new piece of documentation that we could add 
to the ref guide that I should probably contribute -- where can I read this 
documentation? thanks.

 HLog entries are not pushed to peer clusters serially when region-move or RS 
 failure in master cluster
 --

 Key: HBASE-9465
 URL: https://issues.apache.org/jira/browse/HBASE-9465
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Replication
Reporter: Feng Honghua

 When region-move or RS failure occurs in master cluster, the hlog entries 
 that are not pushed before region-move or RS-failure will be pushed by 
 original RS(for region move) or another RS which takes over the remained hlog 
 of dead RS(for RS failure), and the new entries for the same region(s) will 
 be pushed by the RS which now serves the region(s), but they push the hlog 
 entries of a same region concurrently without coordination.
 This treatment can possibly lead to data inconsistency between master and 
 peer clusters:
 1. there are put and then delete written to master cluster
 2. due to region-move / RS-failure, they are pushed by different 
 replication-source threads to peer cluster
 3. if delete is pushed to peer cluster before put, and flush and 
 major-compact occurs in peer cluster before put is pushed to peer cluster, 
 the delete is collected and the put remains in peer cluster
 In this scenario, the put remains in peer cluster, but in master cluster the 
 put is masked by the delete, hence data inconsistency between master and peer 
 clusters

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9465) HLog entries are not pushed to peer clusters serially when region-move or RS failure in master cluster

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762665#comment-13762665
 ] 

Feng Honghua commented on HBASE-9465:
-

[~lhofhansl]

For RS failure scenario, can we delay the assigning of recovered regions until 
all the remained hlog files of the failed RS are pushed to peer clusters (the 
hlog split can be parallel with the hlog push though)? This way we can maintain 
the (global) serial push for hlog entries of a region even in face of RS 
failure.

But for region-move it's harder to maintain global serial push since it's 
harder to determine all the hlog entries of a given region has been pushed to 
peer clusters when the containing RS is healthy and continuously receiving 
write requests.

 HLog entries are not pushed to peer clusters serially when region-move or RS 
 failure in master cluster
 --

 Key: HBASE-9465
 URL: https://issues.apache.org/jira/browse/HBASE-9465
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Replication
Reporter: Feng Honghua

 When region-move or RS failure occurs in master cluster, the hlog entries 
 that are not pushed before region-move or RS-failure will be pushed by 
 original RS(for region move) or another RS which takes over the remained hlog 
 of dead RS(for RS failure), and the new entries for the same region(s) will 
 be pushed by the RS which now serves the region(s), but they push the hlog 
 entries of a same region concurrently without coordination.
 This treatment can possibly lead to data inconsistency between master and 
 peer clusters:
 1. there are put and then delete written to master cluster
 2. due to region-move / RS-failure, they are pushed by different 
 replication-source threads to peer cluster
 3. if delete is pushed to peer cluster before put, and flush and 
 major-compact occurs in peer cluster before put is pushed to peer cluster, 
 the delete is collected and the put remains in peer cluster
 In this scenario, the put remains in peer cluster, but in master cluster the 
 put is masked by the delete, hence data inconsistency between master and peer 
 clusters

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9466) Read-only mode

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762675#comment-13762675
 ] 

Feng Honghua commented on HBASE-9466:
-

[~jdcryans] Yes, I'm proposing a 'read-only mode'(can be per-table or 
per-cluster) rather than 'disabling table', the latter is pretty heavyweight in 
that it needs to offline all regions of the given table.
If we just want to temporarily disable 'update' to a table or the whole cluster 
and later on we want to enable 'update' again, disabling a table or all tables 
of the cluster seems a quite heavy choice.

Thanks for pointing me to the read-only interface of HTableDescriptor of trunk, 
but I don't see any code using it, how is this RO expected to work?

 Read-only mode
 --

 Key: HBASE-9466
 URL: https://issues.apache.org/jira/browse/HBASE-9466
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor

 Can we provide a read-only mode for a table? write to the table in read-only 
 mode will be rejected, but read-only mode is different from disable in that:
 1. it doesn't offline the regions of the table(hence much more lightweight 
 than disable)
 2. it can serve read requests
 Comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9467) write can be totally blocked temporarily by a write-heavy region

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762680#comment-13762680
 ] 

Feng Honghua commented on HBASE-9467:
-

[~nkeywal] Can we provide a percentage config which means how big the sub-set 
of handler threads any region's requests can use? And for any region we can 
hash its region name to a determined start index of the handler thread array, 
and the percentage config together with the count the handler threads 
determines the count of the sub-array of the handler threads to serve this 
region's requests. This way any region at worst can only saturate its sub-set 
of handler threads without impacting all the handler threads, and hence 
somewhat mitigates the symptom

 write can be totally blocked temporarily by a write-heavy region
 

 Key: HBASE-9467
 URL: https://issues.apache.org/jira/browse/HBASE-9467
 Project: HBase
  Issue Type: Improvement
Reporter: Feng Honghua
Priority: Minor

 Write to a region can be blocked temporarily if the memstore of that region 
 reaches the threshold(hbase.hregion.memstore.block.multiplier * 
 hbase.hregion.flush.size) until the memstore of that region is flushed.
 For a write-heavy region, if its write requests saturates all the handler 
 threads of that RS when write blocking for that region occurs, requests of 
 other regions/tables to that RS also can't be served due to no available 
 handler threads...until the pending writes of that write-heavy region are 
 served after the flush is done. Hence during this time period, from the RS 
 perspective it can't serve any request from any table/region just due to a 
 single write-heavy region.
 This sounds not very reasonable, right? Maybe write requests from a region 
 can only be served by a sub-set of the handler threads, and then write 
 blocking of any single region can't lead to the scenario mentioned above?
 Comment?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762683#comment-13762683
 ] 

Feng Honghua commented on HBASE-9468:
-

Sounds just fail-fast for the expired master is a quick and safe fix for this 
issue. Any opinion?

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua

 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762684#comment-13762684
 ] 

Feng Honghua commented on HBASE-9468:
-

[~enis] I agree with you :-). not sure if there is further concern of the 
recovering logic for expired master.

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua

 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9469) Synchronous replication

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762686#comment-13762686
 ] 

Feng Honghua commented on HBASE-9469:
-

[~jdcryans] and [~lhofhansl] : Any plan of synchronous replication? this is 
really a nice feature for applications requiring strict data safety/consistency 
across clusters

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9468) Previous active master can still serves RPC request when it is trying recovering expired zk session

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762754#comment-13762754
 ] 

Feng Honghua commented on HBASE-9468:
-

[~stack] OK, I'll provide a patch according to the comment

 Previous active master can still serves RPC request when it is trying 
 recovering expired zk session
 ---

 Key: HBASE-9468
 URL: https://issues.apache.org/jira/browse/HBASE-9468
 Project: HBase
  Issue Type: Bug
Reporter: Feng Honghua

 When the active master's zk session expires, it'll try to recover zk session, 
 but without turn off its RpcServer. What if a previous backup master has 
 already become the now active master, and some client tries to send request 
 to this expired master by using the cached master info? Any problem here?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9469) Synchronous replication

2013-09-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762755#comment-13762755
 ] 

Feng Honghua commented on HBASE-9469:
-

[~lhofhansl] Yes, the better data safety/consistency of synchronous replication 
is gotten at the cost of higher latency. Maybe it's more acceptable to make it 
per-peer/per-table configurable, let me try to provide a patch accordingly

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua
Priority: Minor

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-09-08 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761563#comment-13761563
 ] 

Feng Honghua commented on HBASE-8755:
-

Thanks [~stack]. Looking forward to your test result on hdfs

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: Performance, wal
Reporter: Feng Honghua
Assignee: stack
Priority: Critical
 Fix For: 0.96.1

 Attachments: 8755trunkV2.txt, HBASE-8755-0.94-V0.patch, 
 HBASE-8755-0.94-V1.patch, HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-07-28 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721973#comment-13721973
 ] 

Feng Honghua commented on HBASE-8755:
-

[~jmspaggi], thanks for your test, some questions about your test: Is it 
against real HDFS? how many data-nodes and RS? what's the write pressure(client 
number, write thread number)? what's the total throughput you get?

Yes this jira aims for throughput improvement under write intensive load. It 
should be tested and verified under write intensive load against real cluster / 
HDFS environment. And as you can see this jira only refactors the write thread 
model rather than tuning any write sub-phase along the whole write path for any 
individual write request, no obvious improvement is expected for low/ordinary 
write pressure.

If you have a real cluster environment with 4 data-nodes, it would be better to 
re-do the test chunhui/I did with the similar test configuration/load which are 
listed in detail in above comments. 1 client with 200 write threads is OK for 
pressing a single RS and 4 clients each with 200 write threads for pressing 4 
RS.

Thanks again.

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-07-24 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718270#comment-13718270
 ] 

Feng Honghua commented on HBASE-8755:
-

[~jmspaggi] HBASE-8755-0.94-V1.patch is good. Let me know if any problem. 
Thanks Jean-Marc

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-07-23 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8755:


Attachment: HBASE-8755-trunk-V1.patch

updated patch rebased on latest trunk code base

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-07-15 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708473#comment-13708473
 ] 

Feng Honghua commented on HBASE-8755:
-

Thanks a lot [~jmspaggi], looking forward to your result.

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-07-14 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708189#comment-13708189
 ] 

Feng Honghua commented on HBASE-8755:
-

[~jmspaggi], what's your result of running YCSB against real cluster 
environment?

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-07-08 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701813#comment-13701813
 ] 

Feng Honghua commented on HBASE-8753:
-

[~lhofhansl] Your comment #2 is covered by this line of code:
  
  if (!hasFamilyStamp || timestamp  familyStamp) {

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes, Scanners
Affects Versions: 0.95.1
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, 
 HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-07-08 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8753:


Attachment: HBASE-8753-trunk-V3.patch
HBASE-8753-0.94-V1.patch

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes, Scanners
Affects Versions: 0.95.1
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, 
 HBASE-8753-0.94-V1.patch, HBASE-8753-trunk-V0.patch, 
 HBASE-8753-trunk-V1.patch, HBASE-8753-trunk-V3.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-07-08 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702008#comment-13702008
 ] 

Feng Honghua commented on HBASE-8753:
-

Thanks [~yuzhih...@gmail.com] and [~lhofhansl]

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes, Scanners
Affects Versions: 0.95.1
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: 8753-trunk-V2.patch, 8753-trunk-v4.txt, 
 HBASE-8753-0.94-V0.patch, HBASE-8753-0.94-V1.patch, 
 HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch, 
 HBASE-8753-trunk-V3.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-07-06 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701465#comment-13701465
 ] 

Feng Honghua commented on HBASE-8753:
-

[~lhofhansl]: Your #1/#2 comments are good.

For #3, your concern is correct. And since the ScanDeleteTracker will be 
reset/cleared (all delete markers) after a row is done, the number of family 
version marker that need to track at any time during scan is all family version 
markers that are put to a specific row, not for all rows and not accumulated 
throughout the whole scan process. Though in theory it's possible that client 
put arbitrarily number of family version markers to a specific row/cf, but in 
practice this number is expected to be equal to or less than the number of 
different versions(timestamps) of all cells a specific row/cf contains.

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes, Scanners
Affects Versions: 0.95.1
Reporter: Feng Honghua
Assignee: Feng Honghua
 Attachments: 8753-trunk-V2.patch, HBASE-8753-0.94-V0.patch, 
 HBASE-8753-trunk-V0.patch, HBASE-8753-trunk-V1.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688936#comment-13688936
 ] 

Feng Honghua commented on HBASE-8755:
-

[~zjushch] We run the same tests as yours, and below are the result:

  1). One YCSB client with 5/50/200 write threads respectively
  2). One RS with 300 RPC handlers, 20 regions  (5 data-nodes back-end HDFS 
running CDH 4.1.1)
  3). row-size = 150 bytes

threads  row-count new-throughputnew-latency old-throughput  
old-latency
---
5 20   3191  1.551(ms)   3172 
1.561(ms)
50   200   23215 2.131(ms)   7437 
6.693(ms)
200  200   35793 5.450(ms)   10816   
18.312(ms)
---

A). the difference is negligible when 5 threads of YCSB client
B). new-model still has 3X+ improvement compared to old-model when threads are 
50/200

Anybody else can help do the similar tests using the same test configuration as 
Chunhui?


 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688957#comment-13688957
 ] 

Feng Honghua commented on HBASE-8755:
-

[~zjushch]: We run the same tests as yours, and below are the result:
1). One YCSB client with 5/50/200 write threads respectively
2). One RS with 300 RPC handlers, 20 regions (5 data-nodes back-end HDFS 
running CDH 4.1.1)
3). row-size = 150 bytes
||client-threads ||row-count ||new-model throughput ||new-model latency 
||old-model throughput||old-model latency||
|5 |20 |3191|1.551(ms) |3172 |1.561(ms)|
|50 |200 |23215 |2.131(ms) |7437 |6.693(ms)|
|200 |200 |35793 |5.450(ms) |10816 |18.312(ms)|
A). the difference is negligible when 5 threads of YCSB client, this is because 
B). new-model still has 3X+ improvement compared to old-model when threads are 
50/200
Can anybody else help do the tests using the same configurations as Chunhui?

Another guess is the HDFS used by chunhui has much better performance on HLog's 
write/sync, which makes the new model in HBase has less impact. Just guess.

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688973#comment-13688973
 ] 

Feng Honghua commented on HBASE-8755:
-

Our comparison tests have only the RS bits different, and all 
others(client/HDFS/cluster/row-size...) remain the same. 

The client runs on a different machine other than the RS, we don't run client 
on RS because almost all our applications using HBase run their application in 
their own machines different from the HBase cluster.

Actually we never saw a such high throughput as 18018/24691 for a single RS in 
our cluster. It's really weird :).

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689094#comment-13689094
 ] 

Feng Honghua commented on HBASE-8755:
-

If possible, would anybody else help do the same comparison test as Chunhui/me? 
Thanks in advance. [~lhofhansl] [~yuzhih...@gmail.com] [~sershe] [~stack]

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689160#comment-13689160
 ] 

Feng Honghua commented on HBASE-8721:
-

I list some merits with behavior 'Delete can't mask puts that happen after the 
delete':

1) Can avoid the inconsistency such as I mentioned above, with our patch, user 
can always read the put by 4. It's more natural and intuitive:

  1 put a kv (timestamp = T0), and flush;
  2 delete that kv using a DeleteColumn type kv with timestamp T0 (or any 
timestamp = T0), and flush;
  3 a major compact occurs [or not];
  4 put that kv again (timestamp = T0);
  5 read that kv;
  ===
  a) if a major compact occurs at step 3, then step 5 will get the put 
written at step 4
  b) if no major compact occurs at step 3, then step 5 get nothing

2) Can provide strong guarantee for such operation: I don't know 
which/how-many versions in a cell, now I (by removing all existing ones) just 
want to put a new version into it and ensure only this new put in the cell 
regardless of the ts comparison with old existing ones (I think this 
operation/guarantee is useful in many scenarios). Current delete behavior can't 
provide such guarantee.

3) 'delete latest version'(deleteColumn() without ts) can be tuned to remove 
the read (latest version for its ts) during 'deleteColumn'. Current delete 
behavior can't be tuned to remove the read operation during 'deleteColumn'

4) 'new put can't be masked (disappear) by old/existing delete' itself is a 
merit for many use-cases / application since it's more natural and intuitive. I 
ever explained many times to different customers for the old semantics of 
version/delete and without exception all the first responses from them are 
weird... why so?

Per my understanding, contrary to [~lhofhansl] and [~sershe], 'timestamp' is 
just a long type to determine versions' ordering using the rule of 'the 
bigger/later wins', and it happens the timestamp in 'time' semantic is a long 
type and new put with its 'current' timestamp has bigger timestamp, and in most 
cases new put versions knock out older ones. And for many use cases 
time-semantic for 'timestamp' is enough for the real-world requirement, but by 
design it's not always the case, otherwise the timestamp won't be exposed for 
user to set it explicitly.

In a word, as long as user knows 'timestamp' is just only the dimension of long 
type to determine the version ordering using the rule 'the bigger wins', he can 
reason out the result of any operation sequences. In essence 'timestamp as a 
dimension for version ordering' doesn't related to delete semantic.

-- I know my understanding is arguable for many guys, since the old delete 
semantic and behavior has existed for so long and everybody has already taken 
it for granted (I mean no offence here)


At last I also list the downside of proposed optional solutions I received:

A 'KEEP_DELETE_CELLS' is definitely a nice feature, but many users don't need 
this feature (to time-travel or trace-back action history) and this feature 
prevent major-compact to shrink data-set by collecting.

B disallow user explicitly set timestamp, this treatment limits HBase's schema 
flexibility, and prohibit many innovative design such as facebook's message 
search index, and at last it can't guarantee unique timestamp hence can still 
lead to tricky / confusing behavior.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689229#comment-13689229
 ] 

Feng Honghua commented on HBASE-8753:
-

[~lhofhansl] For the backwards-compatibility, when old RS processes the 
DeleteFamilyVersion type kv (either written from new client, or the two 
scenarios you mentioned regarding rolling restart), the DeleteFamilyVersion can 
enter ScanDeleteTracker, and its only effect it has is when no DeleteColumn for 
null column with the same timestamp as this DeleteFamilyVersion, this 
DeleteFamilyVersion can delete the KV (column=null) with the same timestamp (a 
bit like the Delete(DeleteVersion) with the same timestamp), and no other 
side-effect.

In summary: DeleteFamilyVersion masks all the versions with a given timestamp 
under a CF, and when an old RS receives it(written from new client, or the two 
scenarios mentioned regarding rolling restart), the old RS treats it like it's 
a Delete(DeleteVersion) for null column. Nothing else.

I think this side-effect is acceptable. Your opinion?

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689935#comment-13689935
 ] 

Feng Honghua commented on HBASE-8721:
-


[~lhofhansl] Now we're extending this to Puts as well, and are saying that a 
Put that hits the RegionServer later should be considered newer even if its TS 
is old, this opens another can of worms

  === Maybe you misunderstand me here, I never proposed 'a Put that hits the 
RegionServer later should be considered newer even if its TS is old'. The 
sequence 'put T3, put T2, put T1' (where T3T2T1) to a CF with max-version = 2 
will result in (T3,T2) and T3 is the first version, though T1 is the last one 
hits RS, this is what I mean by 'timestamp is the only dimension which 
determines version ordering/survival by rule 'the bigger wins''

  === What I proposed is this (can via a config to provide customers the 
option if they want this behavior) : the delete masks (existing) puts with 
timestamps less than or equal to its (not changed); and customers can choose 
whether the delete can mask puts still not written to HBase (future puts) 
according their individual real-world application logic / requirement.


  KEEP_DELETED_CELLS would still work fine, but their main goal is to allow 
correct point-in-time-queries, which among others is important for consistent 
backups

  === KEEP_DELETED_CELLS indeed can prevent the inconsistency in the example 
scenario 'put - delete - (major-compact) - put - get', and it provides a 
consistent result of 'get nothing'. But this result is also unacceptable for 
our customers since they expect the later 'put' not masked by the earlier 
delete.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689938#comment-13689938
 ] 

Feng Honghua commented on HBASE-8721:
-


[~apurtell] / [~lhofhansl]: Maybe I miss something here, is there -1 for 
providing config to the adjustment?

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689947#comment-13689947
 ] 

Feng Honghua commented on HBASE-8721:
-


[~sershe] btw, HBase does support point version deletes as far as I see. So 
specific version can be deleted if desired. Should we add APIs to delete 
latest version? We can even add API to delete all existing versions, won't 
be very efficient with many versions (scan or get+bunch of deletes on server 
side), but it will work without changing internals

  === Yes, I know what you mean. But what I mean is that the deleteColumn 
(without providing timestamp, AKA the 'delete latest version') is not efficient 
since it incurs a 'read' in RS to get the timestamp of the latest version (and 
set it to the Delete type KV). This operation can be tuned by removing the 
'read' in RS. You can find the implementation detail in one of my above 
comments.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689953#comment-13689953
 ] 

Feng Honghua commented on HBASE-8753:
-

By design, DeleteFamilyVersion KV's qualifier is null, hence its comparison to 
any qualifier must be == 0 or  0, so execution can never enter the branch 
of throw new IllegalStateException(isDelete failed: deleteBuffer= when an 
old RS handles a DeleteFamilyVersion. (DeleteFamilyVersion will be 
mis-explained as Delete type with qualifier = null)

And I'll do the backwards-compatibility test as you suggest.

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689975#comment-13689975
 ] 

Feng Honghua commented on HBASE-8721:
-

[~sershe] :

It still need 10+ms, at best, for checkAndPut / increment when the KV read 
isn't in RS's block-cache and need access to HDFS, though this read in RS can 
save 2 rpcs. 

Actually two months ago one of our Xiaomi's internal customers gave up using 
checkAndPut since they can't afford the poor performance, though they did admit 
they love the atomicity checkAndPut provides.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687810#comment-13687810
 ] 

Feng Honghua commented on HBASE-8753:
-

[~lhofhansl] It would be better to have it in 0.94. Thanks.

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688739#comment-13688739
 ] 

Feng Honghua commented on HBASE-8755:
-

Thanks Chunhui for this verification test. We didn't test with small/medium 
write pressure, we'll do tests with small-medium write pressure soon and 
provide the numbers when done.

A quick response on your test result: We never saw a such high throughput as 
24691 for a single RS in cluster before we applied the new write thread model. 
We ever did a series of stress test for write throughput and the maximum we 
ever got is about 1 using 1 YCSB client.

Without patch:
Write Threads: 200 Write Rows: 200 Consume Time: 80s
Avg TPS: 24691
With patch:
Write Threads: 200 Write Rows: 200 Consume Time: 64s
Avg TPS: 30769

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-19 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688754#comment-13688754
 ] 

Feng Honghua commented on HBASE-8755:
-

[~zjushch]

  What's the row-size used in your test? You tested against a back-end HDFS, 
not local disk, right? 

  And would you test using a bit more test data(such as 5,000,000 - 10,000,000 
rows)? Thanks. 

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-18 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686700#comment-13686700
 ] 

Feng Honghua commented on HBASE-8755:
-

Thanks [~yuzhih...@gmail.com] and [~stack] for the detailed review. I make and 
attach a update patch based on trunk according to your reviews.

Below are answers to some important questions Ted/stack raised in the reviews 
(I have already answered some from Ted in above comment):

[Ted] AsyncNotifier does notification by calling syncedTillHere.notifyAll(). 
Can this part be folded into AsyncFlusher ?

 === AsyncNotifier will compete syncedTillHere with all the write handler 
threads(which may finish the appendNoSync but not pend on syncer()). The 
performance is better by separating AsyncSyncer(which just get notified, do 
'sync' and then notify AsyncNotifier) and AsyncNotifier(get notified by 
AsyncSyncer and wake-up all pending write handler threads)

[stack] Any idea of its effect on general latencies? Does 
HLogPerformanceEvaluation help evaluating this approach? Did you deploy this 
code to production?

  === I don't run HLogPerformanceEvaluation for performance comparison. 
instead I used 5 YCSB clients to concurrently press on a single RS with a 5 
data-node underlying HDFS. Everything are the same for test with Old/New write 
thread models except the RS bits are different. We are testing it in the test 
cluster for a month, but not deployed to production yet. Below is the detailed 
performance comparison for your reference.

  a 5 YCSB clients, each with 80 concurrent write theads (auto-flush = true)
  b each YCSB writes 5000,000 rows
  c all 20 regions of the target table are moved to a single RS

Old write thread model:

row size(bytes) latency(ms) QPS
--
200037.310715
100032.812149
500 30.912891
200 26.914803
10  24.516288

New write thread model:

row size(bytes) latency(ms) QPS
---
200017.323024
100012.631523
500 11.733893
200 11.434876
10  11.135804


[stack] Can I still (if only optionally) sync every write as it comes in? (For 
the paranoid).

  === can't for now, I'll consider how to make it configurable later on.

[stack] Regards the above, the test is no longer valid given the indirection 
around sync/flush?

  === Yes, that test is not valid by new write thread modeldeferred log flush

[stack] To be clear, when we call doWrite, we just append the log edit to a 
linked list? (We call it a bufferLock but we just doing append to the linked 
list?)

  === Yes, in both old and new write thread models what doWrite does is just 
appending log edit to a linked list which plays a role as a 'local' buffer for 
log edits what don't hit hdfs deferred log flushyet.

[stack] How does deferred log flush still work when you remove stuff like 
optionalFlushInterval? You say '...don't pend on HLog.syncer() waiting for its 
txid to be sync-ed' but that is another behavior than what we had here 
previously.

  === When say 'still support deferred log flush' I mean for 'deferred log 
flush' it can still response write success to client without wait/pend on 
syncer(txid), in this sense, the AsyncWriter/AsyncSyncer do what the previous 
LogSyncer does from the point view of the write handler threads: clients don't 
wait for the write persist before get reponse success.

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster 

[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-18 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8755:


Attachment: HBASE-8755-trunk-V0.patch

new write thread model patch based on trunk

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-18 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8755:


Attachment: HBASE-8755-0.94-V1.patch

an update patch based on 0.94 according to Ted/stack's review attached

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-18 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686711#comment-13686711
 ] 

Feng Honghua commented on HBASE-8755:
-

HBASE-8755-trunk-V0.patch also includes changes according to the review comment 
from Ted/stack. Thanks again for Ted/stack for the detailed review :-)

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-18 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686818#comment-13686818
 ] 

Feng Honghua commented on HBASE-8755:
-

[~otis] Yes, QPS really means Writes Per Second here. A typo.

btw: My name is Feng Honghua, not Feng Hua :-)

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, 
 HBASE-8755-trunk-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-18 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686862#comment-13686862
 ] 

Feng Honghua commented on HBASE-8721:
-

Another benefit from behaviour delete can't mask puts happened after it(in 
essence mvcc also participates in delete handling): the 'delete latest 
version'(deleteColumn() without timestamp) can have better performance by 
removing the read operation in RS which is to get the timestamp of the latest 
version and set to the delete.

Below is the update process for 'delete latest version' (under 'delete can't 
mask puts happened after it' behaviour):
  
  1. deleteColumn() (without timestamp) issued by client, its timestamp is set 
to an 'invalid' value (0/-1 is a good candidate) to indicate 'delete the latest 
version'. RS just puts this Delete type kv as other type deletes without read 
operation.

  2. when Get/Scan, by timestamp=0/-1 we know this delete is to delete the 
latest version and check the kv it sees. And we know the first kv with mvcc  
'mvcc of this delete' is the 'latest' version when the delete enters RS. After 
delete(mask) this first kv (with mvcc checked) this 'delete latest version' 
delete also need to be removed from the ScanDeleteTracker.

  That's all.

  Then why we can't achieve such light-weight(without read) 'delete latest 
version' delete? The root cause is the 'delete can mask puts that happen after 
it' behaviour, which doesn't use mvcc in delete handling.

  When issuing 'delete latest version'(deleteColumn() without timestamp), the 
real semantic is 'to delete the latest one of all the currently EXISTING 
versions', the EXISTING means the one happened BEFORE the delete enters RS, and 
BEFORE is a concept of operation happening order (indicated by mvcc), which 
can't be represented by timestamp.

  Then why we can't handle 'delete latest version' without a read, as above 
process? Because newer version can be put which has the bigger timestamp (later 
than the 'current' latest when delete enters RS, by timestamp), and by 
behaviour 'delete can mask puts happened after delete'(its essence is to 
determine whether a kv masked by a delete only by comparing their timestamps) a 
'delete latest version' delete can't tell whether the first version it sees is 
the latest version when itself hit RS (in fact it can use mvcc to get this 
information, but it doesn't)

  Certainly we can use mvcc only for 'delete latest version' to get the 
(remarkable) performance gain by removing the read operation, but it sounds 
inconsistent in that we handle deletes internally in different ways (one use 
mvcc, other don't)

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-18 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8753:


Attachment: HBASE-8753-trunk-V0.patch

[~zjushch]/[~stack] patch for trunk attached, thanks  in advance for the review

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-18 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687506#comment-13687506
 ] 

Feng Honghua commented on HBASE-8753:
-

Answer some questions as below:

[chunhui] Could reuse the tag of VERSION_DELETED (other than introducing a new 
FAMILY_VERSION_DELETED)?

  === Per my understanding, the result tag is indicating the kv is masked by 
which type of delete. Introducing a new tag here makes sense since it 
distinguishes itself from VERSION_DELETED which means 'masked by a 
DeleteColumn'.

[chunhui] Maybe we need a better name instead of DeleteFamilyVersion

  === DeleteFamilyVersion is the best name I can come up with till now. We can 
keep it until you recommend a better one. :-)

[stack] Are there changes to KV missing?

  === No. Let me know if you feel something missing/wrong with KV. Thanks :-)

[[~lhofhansl]] :
  Do you want to only delete columns of a specific version, or columns older 
than a specific version? 
  === What this patch does is the former: only delete columns of a specific 
version (without providing their column-names)

  In your use-case, do you ever want to keep version X but target target X+1 
for delete?
  === Yes, this is exactly the effect this patch aims for

  Will this break backwards-compatibility during rolling restarts? (because of 
the new KV type)
  === Yes, old RS bits will ignore DeleteFamilyVersion type KV written by new 
client.

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch, HBASE-8753-trunk-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685019#comment-13685019
 ] 

Feng Honghua commented on HBASE-8753:
-

Thanks chunhui for the review, I'll make a patch for trunk soon

For the name, do you have a better alternative?

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-17 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-8755:
---

 Summary: A new write thread model for HLog to improve the overall 
HBase write throughput
 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua


In current write model, each write handler thread (executing put()) will 
individually go through a full 'append (hlog local buffer) = HLog writer 
append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
which incurs heavy race condition on updateLock and flushLock.

The only optimization where checking if current syncTillHere  txid in 
expectation for other thread help write/sync its own txid to hdfs and omitting 
the write/sync actually help much less than expectation.

Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed 
a new write thread model for writing hdfs sequence file and the prototype 
implementation shows a 4X improvement for throughput (from 17000 to 7+). 

I apply this new write thread model in HLog and the performance test in our 
test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) even 
beats the one of BigTable (Precolator published in 2011 says Bigtable's write 
throughput then is 31002). I can provide the detailed performance test results 
if anyone is interested.

The change for new write thread model is as below:
 1 All put handler threads append the edits to HLog's local pending buffer; 
(it notifies AsyncWriter thread that there is new edits in local buffer)
 2 All put handler threads wait in HLog.syncer() function for underlying 
threads to finish the sync that contains its txid;
 3 An single AsyncWriter thread is responsible for retrieve all the buffered 
edits in HLog's local pending buffer and write to the hdfs 
(hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes 
to hdfs that needs a sync)
 4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs to 
persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that 
sync watermark increases)
 5 An single AsyncNotifier thread is responsible for notifying all pending put 
handler threads which are waiting in the HLog.syncer() function
 6 No LogSyncer thread any more (since there is always 
AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-17 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8755:


Attachment: HBASE-8755-0.94-V0.patch

the patch HBASE-8755-0.94-V0.patch is based on 
http://svn.apache.org/repos/asf/hbase/branches/0.94

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7280) TableNotFoundException thrown in peer cluster will incur endless retry for shipEdits, which in turn block following normal replication

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685495#comment-13685495
 ] 

Feng Honghua commented on HBASE-7280:
-

[~jeason] You can refer to HBASE-8751 for per-peer cf/table granularity 
replication

 TableNotFoundException thrown in peer cluster will incur endless retry for 
 shipEdits, which in turn block following normal replication
 --

 Key: HBASE-7280
 URL: https://issues.apache.org/jira/browse/HBASE-7280
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.2
Reporter: Feng Honghua
   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 in cluster replication, if the master cluster have 2 tables which have 
 column-family declared with replication scope = 1, and add a peer cluster 
 which has only 1 table with the same name as the master cluster, in the 
 ReplicationSource (thread in master cluster) for this peer, edits (logs) for 
 both tables will be shipped to the peer, the peer will fail applying the 
 edits due to TableNotFoundException, and this exception will also be 
 responsed to the original shipper (ReplicationSource in master cluster), and 
 the shipper will fall into an endless retry for shipping the failed edits 
 without proceeding to read the remained(newer) log files and to ship 
 following edits(maybe the normal, expected edit for the registered table). 
 the symptom looks like the TableNotFoundException incurs endless retry and 
 blocking normal table replication

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685655#comment-13685655
 ] 

Feng Honghua commented on HBASE-8721:
-

Thanks guys for your feedback: [~apurtell], [~sershe], [~stack], [~lhofhansl]

I summarize issues/proposals as below:

A). We all agree this IS a bug:
  1 put a kv (timestamp = T0), and flush;
  2 delete that kv using a DeleteColumn type kv with timestamp T0 (or any 
timestamp = T0), and flush;
  3 a major compact occurs [or not];
  4 put that kv again (timestamp = T0);
  5 read that kv;

  a) if a major compact occurs at step 3, then step 5 will get the put 
written at step 4
  b) if no major compact occurs at step 3, then step 5 get nothing

B). [~stack] proposes to keep all deleted cells. This can be achieved either by 
turning on the KeepDeletedCells for ColumnFamilies or by degenerating 
major-compact to minor-compact (I guess you mean the former one). But these two 
options both result in a bigger data size than expectation.

C). [~lhofhansl] suggests to introduce a config for a Table/CF to disallow 
client to set timestamps when put. As a config, it means client still can 
create tables/CFs that allow him to explicitly set timestamps, and for these 
tables/CFs, bug of A) still exists.

D). As [~lhofhansl] said, timestamp is part of the Schema, it's visible to and 
can be set by client, hence it can be exploited by client for more general 
usage. For 'general' I mean it's not limited for only 'time' semantic, but as 
an ordinary dimension of a cell's coordinate. Such treatment can lead to many 
innovative schema design to address more complicated real-world problems. 

  Facebook uses msg-id as timestamp in their message search index CF. When 
using timestamp as an ordinary dimension of a cell's coordinate, that cell 
naturally has only one 'version' in the app context, and the CF usually to set 
the MaxVersions in HBase context to the max-size for accommodate as many 
different cells as possible. The client who uses timestamp as such general 
usage takes care of all the subtlety derived from this semantic change.

  Facebook's design details can be referred to in book 'HBase The Definitive 
Guide' - Chapter 9 Advanced Usage - Search Integration (page 374) or blog: 
http://www.facebook.com/notes/facebook-engineering/inside-facebook-messages-application-server/10150162742108920.

  Disabling client set timestamps or limiting timestamp only with 'time' 
semantic will prohibit such innovative usage of timestamp. As said, a good 
language/platform/product encourages and enables innovative extension/usage out 
of the original designer's imagination. We do expect HBase to be such a good 
platform/product, right?

E). [~apurtell] said: This section of the book describes expected behavior. 
This is not a bug.

  I disagree. That section's title explicitly says it's 'current limitations' 
and explains in details why. It is by nature not an acceptable behaviour. It's 
counter-common-sense and counter-intuition. It now seems an 'expected 
behaviour' JUST because it exists from the very beginning.

F). [~lhofhansl] said: HBase allows you to set the timestamps to influence the 
logical order in which things (are declared to have) happened. If you do not 
want strange behavior do not date Deletes into the future and Puts into the 
past. Period.

  As bug in A), strange behaviour occurs even dating Deletes/Puts into the same 
timestamp, but one the future and the other the past. (We allow setting 
timestamp, and we do set it) We get strange(buggy) behaviour when we put - 
delete - put - get that very same KV with that same timestamp. Isn't it weird?

G). [~lhofhansl] said: If we did not have that as-of-time queries would be 
broken and we would break the idempotent nature of operations in HBase

  For idempotent nature of operations in HBase, my understanding is a series 
of Puts(or Deletes) for a same cell(exactly the same coordinate:value) will 
result in an eventually same result. But it's expected to be broken if 
interleaved by Deletes(Deletes interleaved by Puts). Such idempotent nature 
break is acceptable according to my opinion.
  Even we don't change the behaviour 'Deletes can mask puts that happen after 
the delete, scenario in A) still breaks the idempotent nature: we put that 
same cell multiple times, but the results can turn out to be different when 
interleaved by Deletes (with the effect of major compact together).

H). Since HBase is modeled after BigTable, so it makes sense we align the 
Delete behaviour here with BigTable, right?

I). At last, I think we need to have an open mind for this issue, not just 
suggesting a workaround at the cost of HBase's inherent flexibility.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: 

[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685671#comment-13685671
 ] 

Feng Honghua commented on HBASE-8755:
-

Thanks [~yuzhih...@gmail.com] for the detailed code review, answers to some 
important questions as below:

1) Deferred log sync is still supported, in which case the write handler thread 
in Region don't pend on HLog.syncer() waiting for its txid to be sync-ed. This 
behaviour keeps intact.

2) Good catch, if (txid = this.pendingTxid) return; in setPendingTxid() 
moved into below synchronized (this.writeLock) {...} is OK. 
   bufferLock is used only to guarantee the access to HLog's local pending 
buffer and unflushedEntries can't be interleaved by multiple threads.

3) failedTxid is still assigned to this.lastWrittenTxid. Is that safe ?
  Yes. all write pending on syncer() with txid = failedTxid will get failure. 
The lastWrittenTxid can safely proceed without incorrect behaviour, and it need 
to proceed to eventually wake-up pending write handler threads.

I'll update the patch per your review tomorrow.

 A new write thread model for HLog to improve the overall HBase write 
 throughput
 ---

 Key: HBASE-8755
 URL: https://issues.apache.org/jira/browse/HBASE-8755
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Feng Honghua
 Attachments: HBASE-8755-0.94-V0.patch


 In current write model, each write handler thread (executing put()) will 
 individually go through a full 'append (hlog local buffer) = HLog writer 
 append (write to hdfs) = HLog writer sync (sync hdfs)' cycle for each write, 
 which incurs heavy race condition on updateLock and flushLock.
 The only optimization where checking if current syncTillHere  txid in 
 expectation for other thread help write/sync its own txid to hdfs and 
 omitting the write/sync actually help much less than expectation.
 Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
 proposed a new write thread model for writing hdfs sequence file and the 
 prototype implementation shows a 4X improvement for throughput (from 17000 to 
 7+). 
 I apply this new write thread model in HLog and the performance test in our 
 test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
 RS, from 22000 to 7 for 5 RS), the 1 RS write throughput (1K row-size) 
 even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
 write throughput then is 31002). I can provide the detailed performance test 
 results if anyone is interested.
 The change for new write thread model is as below:
  1 All put handler threads append the edits to HLog's local pending buffer; 
 (it notifies AsyncWriter thread that there is new edits in local buffer)
  2 All put handler threads wait in HLog.syncer() function for underlying 
 threads to finish the sync that contains its txid;
  3 An single AsyncWriter thread is responsible for retrieve all the buffered 
 edits in HLog's local pending buffer and write to the hdfs 
 (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
 writes to hdfs that needs a sync)
  4 An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
 to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
 that sync watermark increases)
  5 An single AsyncNotifier thread is responsible for notifying all pending 
 put handler threads which are waiting in the HLog.syncer() function
  6 No LogSyncer thread any more (since there is always 
 AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685699#comment-13685699
 ] 

Feng Honghua commented on HBASE-8721:
-

[~apurtell] Sorry if I offend. And thanks for your patient comment again. All 
we need here is to clarify the issue and make HBase better.

I don't think that is a correct statement.
== You mean the behaviour of A) is correct and acceptable?

Can you say a bit more about what exactly your clients are doing with 
timestamps?
== We have a similar usage of timestamp as Facebook, I provide the 
link/reference describing how facebook use timestamp for their message search 
index. In short, msg-id is used as timestamp to provide a reverse-link for the 
term/token appeared in a msg to find its original msg. When a msg is deleted, 
all the kv of its related term/token will be deleted as well, and when user 
revoke the deleted msg from the deleted-folder, the term/token kvs will be 
inserted again, but with the current delete behaviour the re-inserting of 
term/token kvs for revoked msg can be inconsistent, the same scenario as A)

For [~sershe]'s explanation:

If you are setting explicit timestamps, you are explicitly telling HBase that 
it should withhold judgement about versions because you know what happens 
logically before and after in your system. If you are using timestamp otherwise 
for some convenience, you are misusing it.

== we set explicit timestamps and we don't want judgement about versions(refer 
to below description of our scenarios). but the behaviour 'Deletes mask puts 
that happen after the delete' put us in a difficult situation.
  Actually if we set explicit timestamp, the timestamp can't be the 'current' 
time when the put hit RS, so this timestamp can seldom has 'time' semantic in 
this sense since it's inaccurate for time ordering. so If you are using 
timestamp otherwise for some convenience, you are misusing it almost equals to 
setting explicit timestamps is misusing it?

If this version semantic is removed, timestamp becomes simply a long tucked 
unto a KeyValue and should be removed, after all, we don't have a string or a 
boolean also added to KeyValue so that people could use them for their 
purposes. HBase already has columns and column families to do that. Timestamp 
has very explicit semantics and purpose right now. If you want time-based 
behavior then don't set timestamps and HBase will use time-based behavior.

== another 'long' tucked onto a KeyValue is not unneccessary, even HBase 
already has columns and column-families. In facebook message search index 
scenarios, using msg-id as timestamp is an innovative way to build the reverse 
lookup index atomically by leverage the row-transaction. Otherwise the reverse 
lookup index can't be built atomically since the msg and the msg-search-index 
of a given user can span multiple rows.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685702#comment-13685702
 ] 

Feng Honghua commented on HBASE-8753:
-

OK. thanks [~yuzhih...@gmail.com] for the code review.

Was deleteFamilyStamp the previous name for this feature ? 

== No. can't come up with a better name for this log  then. 
deleteFamilyVersion has no 'timestamp' meaning, and deleteFamilyVersionStamp is 
too long, so just remove the 'Version' and remain 'Stamp' to use it to indicate 
'timestamp'...

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686302#comment-13686302
 ] 

Feng Honghua commented on HBASE-8721:
-

Another some drawbacks for the config of disallowing client set timestamps 
explicitly from [~lhofhansl]:

  1). No easy way to delete a specific version except the latest version: 
Client need to read all versions out, get the timestamp of the version he want 
to delete, then issue the deleteColumn(); The reason is client doesn't know the 
exact timestamps of each version when Put

  2). Performance is poor for deleting a version (rather than all versions of 
that cell): All delete for version need to read the timestamp before deleting, 
the deleteColumn() without timestamp for deleting the latest version also need 
to read the latest timestamp in RS, though transparent to the client

  3). If put without setting timestamp, multiple puts for the same KV (in 
perspective of the client it's the same) will get different timestamps when 
hitting RS and actually are not a same KV in perspective of HBase, they occupy 
multiple versions which knock out earlier 'real' versions; different times of 
Puts of a 'same' KV(without timestamp) from client can result in different 
'version list' of that cell in HBase. This is not idempotent in the strict 
sense.

  4). Even disallowing explict set timestamp, strange behavior can still arise 
due to clock skew or timestamp's time granularity(Puts/Deletes can have a same 
timestamp in milli-second). HBASE-2256 is an example of the latter.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) Deletes can mask puts that happen after the delete

2013-06-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686327#comment-13686327
 ] 

Feng Honghua commented on HBASE-8721:
-

  On the contrary, personally I think allowing client explicitly set timestamp 
is giving client the ability to EXACTLY control the versions semantic, without 
the impact from clock skew or HBASE-2256. By explicitly setting timestmaps for 
each KVs he put, client know exactly at any time point which versions will 
survive without worrying about exceptional cases such as clock skew or 
HBASE-2256.

  Below are 'acceptable' behaviors regarding delete/version from my point of 
view.

  1 version is determined by 'timestamp' only(same as current semantic), 
HBase(we) determines which version survive (in Scan/Compact etc.) only by 
timestamp.
  
  2 delete can only mask puts happened before it('before' here is measured by 
vector clock, mvcc in HBase, not by timestamp). All puts happened before a 
delete are candidates to be masked by that delete, but whether a candidate put 
will be actually masked by that delete further depends on whether the candidate 
put's timestamp is smaller than or equal to delete's timestamp. 

  So delete's semantic is: to delete an existing exact version (deleteColumn) 
or all existing smaller versions (deleteColumns / deleteFamily)

  These two version/delete semantics/behaviors have no conflicts.

 Deletes can mask puts that happen after the delete
 --

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-06-16 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-8751:
---

 Summary: Enable peer cluster to choose/change the 
ColumnFamilies/Tables it really want to replicate from a source cluster
 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua


Consider scenarios (all cf are with replication-scope=1):

1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
has cf1,cf2.

2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
cluster S.

3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.

Current replication implementation can't achieve this since it'll push the data 
of all the replicatable column-families from cluster S to all its peers, X/Y in 
this scenario.

This improvement provides a fine-grained replication theme which enable peer 
cluster to choose the column-families/tables they really want from the source 
cluster:

A). Set the table:cf-list for a peer when addPeer:
  hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
table3:cf2

B). View the table:cf-list config for a peer using show_peer_tableCFs:
  hbase-shell show_peer_tableCFs 1

C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
  hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2

In this theme, replication-scope=1 only means a column-family CAN be replicated 
to other clusters, but only the 'table:cf-list list' determines WHICH cf/table 
will actually be replicated to a specific peer.

To provide back-compatibility, empty 'table:cf-list list' will replicate all 
replicatable cf/table. (this means we don't allow a peer which replicates 
nothing from a source cluster, we think it's reasonable: if replicating nothing 
why bother adding a peer?)

This improvement addresses the exact problem raised  by the first FAQ in 
http://hbase.apache.org/replication.html:
  GLOBAL means replicate? Any provision to replicate only to cluster X and not 
to cluster Y? or is that for later?
  Yes, this is for much later.

I also noticed somebody mentioned replication-scope as integer rather than a 
boolean is for such fine-grained replication purpose, but I think extending 
replication-scope can't achieve the same replication granularity flexibility 
as providing above per-peer replication configurations.

This improvement has been running smoothly in our production clusters (Xiaomi) 
for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-06-16 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8751:


Attachment: HBASE-8751-0.94-V0.patch

this patch is based on code from 
http://svn.apache.org/repos/asf/hbase/branches/0.94

 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want 
 to replicate from a source cluster
 

 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua
 Attachments: HBASE-8751-0.94-V0.patch


 Consider scenarios (all cf are with replication-scope=1):
 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
 has cf1,cf2.
 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
 cluster S.
 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.
 Current replication implementation can't achieve this since it'll push the 
 data of all the replicatable column-families from cluster S to all its peers, 
 X/Y in this scenario.
 This improvement provides a fine-grained replication theme which enable peer 
 cluster to choose the column-families/tables they really want from the source 
 cluster:
 A). Set the table:cf-list for a peer when addPeer:
   hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
 table3:cf2
 B). View the table:cf-list config for a peer using show_peer_tableCFs:
   hbase-shell show_peer_tableCFs 1
 C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
   hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2
 In this theme, replication-scope=1 only means a column-family CAN be 
 replicated to other clusters, but only the 'table:cf-list list' determines 
 WHICH cf/table will actually be replicated to a specific peer.
 To provide back-compatibility, empty 'table:cf-list list' will replicate all 
 replicatable cf/table. (this means we don't allow a peer which replicates 
 nothing from a source cluster, we think it's reasonable: if replicating 
 nothing why bother adding a peer?)
 This improvement addresses the exact problem raised  by the first FAQ in 
 http://hbase.apache.org/replication.html:
   GLOBAL means replicate? Any provision to replicate only to cluster X and 
 not to cluster Y? or is that for later?
   Yes, this is for much later.
 I also noticed somebody mentioned replication-scope as integer rather than 
 a boolean is for such fine-grained replication purpose, but I think extending 
 replication-scope can't achieve the same replication granularity 
 flexibility as providing above per-peer replication configurations.
 This improvement has been running smoothly in our production clusters 
 (Xiaomi) for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-16 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-8753:
---

 Summary: Provide new delete flag which can delete all cells under 
a column-family which have a same designated timestamp
 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua


In one of our production scenario (Xiaomi message search), multiple cells will 
be put in batch using a same timestamp with different column names under a 
specific column-family. 

And after some time these cells also need to be deleted in batch by given a 
specific timestamp. But the column names are parsed tokens which can be 
arbitrary words , so such batch delete is impossible without first retrieving 
all KVs from that CF and get the column name list which has KV with that given 
timestamp, and then issuing individual deleteColumn for each column in that 
column-list.

Though it's possible to do such batch delete, its performance is poor, and 
customers also find their code is quite clumsy by first retrieving and 
populating the column list and then issuing a deleteColumn for each column in 
that column-list.

This feature resolves this problem by introducing a new delete flag: 
DeleteFamilyVersion. 

  1). When you need to delete all KVs under a column-family with a given 
timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn / 
Delete) without read operation;

  2). Like other delete types, DeleteFamilyVersion takes effect in 
get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
uses DeleteFamilyVersion to prevent all KVs under the specific CF which has the 
same timestamp as the DeleteFamilyVersion KV to pop-up as part of a get/scan 
result (also in flush/compact).

Our customers find this feature efficient, clean and easy-to-use since it does 
its work without knowing the exact column names list that needs to be deleted. 

This feature has been running smoothly for a couple of months in our production 
clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8753) Provide new delete flag which can delete all cells under a column-family which have a same designated timestamp

2013-06-16 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8753:


Attachment: HBASE-8753-0.94-V0.patch

patch HBASE-8753-0.94-V0.patch is based on  
http://svn.apache.org/repos/asf/hbase/branches/0.94

 Provide new delete flag which can delete all cells under a column-family 
 which have a same designated timestamp
 ---

 Key: HBASE-8753
 URL: https://issues.apache.org/jira/browse/HBASE-8753
 Project: HBase
  Issue Type: New Feature
  Components: Deletes
Reporter: Feng Honghua
 Attachments: HBASE-8753-0.94-V0.patch


 In one of our production scenario (Xiaomi message search), multiple cells 
 will be put in batch using a same timestamp with different column names under 
 a specific column-family. 
 And after some time these cells also need to be deleted in batch by given a 
 specific timestamp. But the column names are parsed tokens which can be 
 arbitrary words , so such batch delete is impossible without first retrieving 
 all KVs from that CF and get the column name list which has KV with that 
 given timestamp, and then issuing individual deleteColumn for each column in 
 that column-list.
 Though it's possible to do such batch delete, its performance is poor, and 
 customers also find their code is quite clumsy by first retrieving and 
 populating the column list and then issuing a deleteColumn for each column in 
 that column-list.
 This feature resolves this problem by introducing a new delete flag: 
 DeleteFamilyVersion. 
   1). When you need to delete all KVs under a column-family with a given 
 timestamp, just call Delete.deleteFamilyVersion(cfName, timestamp); only a 
 DeleteFamilyVersion type KV is put to HBase (like DeleteFamily / DeleteColumn 
 / Delete) without read operation;
   2). Like other delete types, DeleteFamilyVersion takes effect in 
 get/scan/flush/compact operations, the ScanDeleteTracker now parses out and 
 uses DeleteFamilyVersion to prevent all KVs under the specific CF which has 
 the same timestamp as the DeleteFamilyVersion KV to pop-up as part of a 
 get/scan result (also in flush/compact).
 Our customers find this feature efficient, clean and easy-to-use since it 
 does its work without knowing the exact column names list that needs to be 
 deleted. 
 This feature has been running smoothly for a couple of months in our 
 production clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8751) Enable peer cluster to choose/change the ColumnFamilies/Tables it really want to replicate from a source cluster

2013-06-16 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13684933#comment-13684933
 ] 

Feng Honghua commented on HBASE-8751:
-

Showing up in the 'table:cf-list list' doesn't guarantee a CF will be 
replicated: only CF with replication-scope=1 in the 'table:cf-list list' will 
be replicated. 

 Enable peer cluster to choose/change the ColumnFamilies/Tables it really want 
 to replicate from a source cluster
 

 Key: HBASE-8751
 URL: https://issues.apache.org/jira/browse/HBASE-8751
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Reporter: Feng Honghua
 Attachments: HBASE-8751-0.94-V0.patch


 Consider scenarios (all cf are with replication-scope=1):
 1) cluster S has 3 tables, table A has cfA,cfB, table B has cfX,cfY, table C 
 has cf1,cf2.
 2) cluster X wants to replicate table A : cfA, table B : cfX and table C from 
 cluster S.
 3) cluster Y wants to replicate table B : cfY, table C : cf2 from cluster S.
 Current replication implementation can't achieve this since it'll push the 
 data of all the replicatable column-families from cluster S to all its peers, 
 X/Y in this scenario.
 This improvement provides a fine-grained replication theme which enable peer 
 cluster to choose the column-families/tables they really want from the source 
 cluster:
 A). Set the table:cf-list for a peer when addPeer:
   hbase-shell add_peer '3', zk:1100:/hbase, table1; table2:cf1,cf2; 
 table3:cf2
 B). View the table:cf-list config for a peer using show_peer_tableCFs:
   hbase-shell show_peer_tableCFs 1
 C). Change/set the table:cf-list for a peer using set_peer_tableCFs:
   hbase-shell set_peer_tableCFs '2', table1:cfX; table2:cf1; table3:cf1,cf2
 In this theme, replication-scope=1 only means a column-family CAN be 
 replicated to other clusters, but only the 'table:cf-list list' determines 
 WHICH cf/table will actually be replicated to a specific peer.
 To provide back-compatibility, empty 'table:cf-list list' will replicate all 
 replicatable cf/table. (this means we don't allow a peer which replicates 
 nothing from a source cluster, we think it's reasonable: if replicating 
 nothing why bother adding a peer?)
 This improvement addresses the exact problem raised  by the first FAQ in 
 http://hbase.apache.org/replication.html:
   GLOBAL means replicate? Any provision to replicate only to cluster X and 
 not to cluster Y? or is that for later?
   Yes, this is for much later.
 I also noticed somebody mentioned replication-scope as integer rather than 
 a boolean is for such fine-grained replication purpose, but I think extending 
 replication-scope can't achieve the same replication granularity 
 flexibility as providing above per-peer replication configurations.
 This improvement has been running smoothly in our production clusters 
 (Xiaomi) for several months.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-12 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681200#comment-13681200
 ] 

Feng Honghua commented on HBASE-8721:
-

[~stack]
1. I also agree keeping deleted cells forever(even during major compact) can 
fix the inconsistency in the scenario I mentioned before. But that treatment 
will make the HFile size continuously grow without shrinking by collecting 
deleted cells and delete flags. This is an obvious drawback by keeping deleted 
cells(together with delete flags) which is not desirable by many users.
2. The behaviour delete can mask puts that happened after the delete is 
unacceptable for many users. When a user puts a kv to HBase, his intention is 
to ADD that kv to HBase and definitely he want to be able to retrieve that kv 
back using a Get/Scan operation without regard to whether or not there is a 
delete ever occurred. Why current behaviour is unacceptable for two reasons: a 
When a user puts a kv, receives success response, and fails to read it out, 
he'll be confused why and it's hard for him to realize that the reason is 
someone or himself ever wrote a delete before; b If delete can mask puts 
happened after that delete, this means once a delete is written to HBase(till 
it's collected by major compact), it can block that kv be added back to HBase 
again forever(by semantic) even though that kv can be added back to HBase 
successfully using 'put' operation(by syntactic)
3. Yes, my fix is really to adjust the behaviour delete can mask puts that 
happened after the delete to the one that delete can only mask puts that 
happened before(or equal) the delete. With this behaviour adjustment the 
inconsistency caused by major compact doesn't appear again.
4. My fix is using mvcc together with timestamp to determine whether a delete 
can mask a put. This treatment doesn't break the original delete semantic 
defined by timestamp alone, but enforced with mvcc that define the real 
ordering of time-point that operations occur(timestamp can't). Why I don't use 
sequence-id: a when flushed/compacted to HFile, sequence-id doesn't accompany 
kv any longer, but mvcc does; if we use seq-id, we can't handle kv in hfile for 
this purpose; b Yes the seq-id defines persistence ordering of kv and the mvcc 
defines visibility ordering of kv, and they can interleave for two kvs, but it 
doesn't hurt the correctness of our adjustment behaviour in that when seq-id 
advances user can't see(read) that kv until its mvcc advances. visibility 
occurs after persistence(mvcc is after seq-id). seq-id is background 
implementation detail and user isn't aware of it while mvcc impacts data 
visibility and user is aware of it.

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-12 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681204#comment-13681204
 ] 

Feng Honghua commented on HBASE-8721:
-

[~sershe]

This fix is not about timestamp as a role to control version, but whether the 
behaviour 'delete can mask puts that happened after the delete is 
reasonable/acceptable. I totally agree with you that timestamp is the only 
criterion to define/control version semantic, and this isn't broken by my fix. 
You can get what I mean by my above comment to Stack or by review and 
understand my patch. Thanks a lot

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-11 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680323#comment-13680323
 ] 

Feng Honghua commented on HBASE-8721:
-

[~sershe]

If we want to keep the behaviour that delete can mask puts that happened after 
the delete, to fix the inconsistency issue caused by major compact, the only 
alternative is to keep the delete markers forever, as you said.

But I think the inconsistency issue's root cause is the arguable behaviour that 
delete can mask puts that happened after the delete. A more intuitive and more 
reasonable behaviour is that a delete can only mask puts happened before it, 
and has no impact on puts happened after it. (This behaviour has nothing to do 
with another behaviour that timestamp determines which kv survives regarding 
version semantic.) And if we choose this adjusted behaviour, we can fix the 
inconsistency issue just with the help of mvcc, and collect the delete markers 
during major compact as before (no need to keep them forever to fix that 
inconsistency)

A obvious, and ridiculous drawback of the behaviour that delete can mask puts 
that happened after the delete is that when an end user puts a kv, gets success 
response but it turns out that he can't read out that kv just because 
someone(maybe this someone is himself, but he can't realize this) ever made a 
delete that can mask this kv...this sounds really uncanny and weird.

Turns back to scenarios that timestamp is used as another ordinary dimension 
without time semantic, in those cases we declare max(int) for the versions, and 
in that scheme timestamp isn't used to control version count but as an ordinary 
dimension to locate a cell. And each cell has a single version. So no problem.

I agree we can introduce a config knob to enable the new behaviour.

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679339#comment-13679339
 ] 

Feng Honghua commented on HBASE-8721:
-

[~andrew.purt...@gmail.com] Consider this scenario: first put a KV with 
timestamp T0, and then delete it with timestamp T1 (T1T0), and then put that 
KV again with timestamp T0. Now whether a get/scan can read out KV depends on 
whether a major compact occurs between the delete and second put, if there is a 
major compact occurs during that time frame, the delete will be collected by 
the major compact, therefore the second put survives and can be read out by a 
get/scan, on contrary if no major compact occurs during that time frame the 
second put and the first put will both be masked by the delete and can't be 
read out by get/scan. This means by current delete handling mechanism data's 
visibility sometimes depends on if a major compact occurs during a tricky time 
frame, this behaviour is weird and unacceptable since major compact should be 
transparent to end users and should by no means have impact on data's 
visibility.
And the behaviour that a later put can be masked by a earlier delete itself is 
somewhat weird and can make end users confusing in many scenarios. Some HBase 
users in our company complain about this behaviour and claim this behaviour is 
unacceptable from their end users' viewpoint. And I also noticed there is a 
same discussion on this issue by jira HBASE-2256

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-10 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680126#comment-13680126
 ] 

Feng Honghua commented on HBASE-8721:
-

[~sershe]/[~apurtell] The scenario is refined as below:
1. put a kv with timestamp T0, and flush;
2. delete that kv with timestamp T1, and flush;
3. a major compact occurs [or not];
4. put that kv again with timestamp T0;
5. read that kv;

a). if a major compact occurs at step 3, then step 5 will get the put written 
at step 4
b). if no major compact occurs at step 3, then step 5 get nothing

I think this is a BUG.

And I also DON'T think the behaviour where deletes masks later puts with the 
same ts is expected. In some real-world scenarios where timestamp is used NOT 
with time semantic but as another ordinary dimension of the kv's coordinate, 
user puts a kv, deletes it, and some time later puts that kv again, finds the 
write succeeds but can't read it out. Current delete behaviour limits such 
extended usage of timestamp dimension.

Do we accept this incorrect/buggy behaviour as is just because it exists not 
fixed for a so long time?

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-09 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-8721:
---

 Summary: fix for bug that delete can mask puts that happened after 
the delete was entered
 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua


this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:

Deletes mask puts, even puts that happened after the delete was entered. 
Remember that a delete writes a tombstone, which only disappears after then 
next major compaction has run. Suppose you do a delete of everything = T. 
After this you do a new put with a timestamp = T. This put, even if it 
happened after the delete, will be masked by the delete tombstone. Performing 
the put will not fail, but when you do a get you will notice the put did have 
no effect. It will start working again after the major compaction has run. 
These issues should not be a problem if you use always-increasing versions for 
new puts to a row. But they can occur even if you do not care about time: just 
do delete and put immediately after each other, and there is some chance they 
happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-09 Thread Feng Honghua (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-8721:


Attachment: HBASE-8721-0.94-V0.patch

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678994#comment-13678994
 ] 

Feng Honghua commented on HBASE-8721:
-

This fix uses mvcc together with timestamp to determine if a put will be masked 
by a delete. So the put happened after a delete can't be masked by that delete 
though according to timestamp it will be masked(before this fix). In this theme 
mvcc will be kept intact unconditionally (not set to 0) when doing 
flush/compact, and furthermore mvcc also need to be serialized(and later 
unserialized) to HLog for possible replay, and max mvcc in recovered split HLog 
also need to be reflected in the recovered HRegion during its initializing 
phase.

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8721) fix for bug that delete can mask puts that happened after the delete was entered

2013-06-09 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678995#comment-13678995
 ] 

Feng Honghua commented on HBASE-8721:
-

this patch is based on code downloaded from 
http://svn.apache.org/repos/asf/hbase/branches/0.94 and all the unit-tests pass 
in my local environment

 fix for bug that delete can mask puts that happened after the delete was 
 entered
 

 Key: HBASE-8721
 URL: https://issues.apache.org/jira/browse/HBASE-8721
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Feng Honghua
 Attachments: HBASE-8721-0.94-V0.patch


 this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
 Deletes mask puts, even puts that happened after the delete was entered. 
 Remember that a delete writes a tombstone, which only disappears after then 
 next major compaction has run. Suppose you do a delete of everything = T. 
 After this you do a new put with a timestamp = T. This put, even if it 
 happened after the delete, will be masked by the delete tombstone. Performing 
 the put will not fail, but when you do a get you will notice the put did have 
 no effect. It will start working again after the major compaction has run. 
 These issues should not be a problem if you use always-increasing versions 
 for new puts to a row. But they can occur even if you do not care about time: 
 just do delete and put immediately after each other, and there is some chance 
 they happen within the same millisecond.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8357) current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server

2013-04-17 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634718#comment-13634718
 ] 

Feng Honghua commented on HBASE-8357:
-

OK, I'm using hbase 0.94.3 and came up with this issue when reading code of 
replication rs failover. Very glad it's fixed already. Please Lars help close 
it as duplicated or fixed? Thanks

 current region server failover mechanism for replication can lead to stale 
 region server whose left hlogs can't be replicated by other region server
 

 Key: HBASE-8357
 URL: https://issues.apache.org/jira/browse/HBASE-8357
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.3
Reporter: Feng Honghua

 consider this scenario: rs A/B/C, A dies, B and C race to lock A to help 
 replicate A's left unreplicated hlogs, B wins and successfully creates lock 
 under A's znode, but before B copies A's hlog queues to its own znode, B also 
 dies, and C successfully creates lock under B's znode and helps replicate B's 
 own left hlogs. But A's left hlogs can't be replicated by any other rs since 
 B left back a lock under A's znode and B didn't transfer A's hlog queues to 
 its own znode before B dies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8357) current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server

2013-04-16 Thread Feng Honghua (JIRA)
Feng Honghua created HBASE-8357:
---

 Summary: current region server failover mechanism for replication 
can lead to stale region server whose left hlogs can't be replicated by other 
region server
 Key: HBASE-8357
 URL: https://issues.apache.org/jira/browse/HBASE-8357
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.3
Reporter: Feng Honghua


consider this scenario: rs A/B/C, A dies, B and C race to lock A to help 
replicate A's left unreplicated hlogs, B wins and successfully creates lock 
under A's znode, but before B copies A's hlog queues to its own znode, B also 
dies, and C successfully creates lock under B's znode and helps replicate B's 
own left hlogs. But A's left hlogs can't be replicated by any other rs since B 
left back a lock under A's znode and B didn't transfer A's hlog queues to its 
own znode before B dies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7632) fail to create ReplicationSource if ReplicationPeer.startStateTracker checkExists(peerStateNode) and find not exist but fails in createAndWatch due to client/shell is d

2013-01-20 Thread Feng Honghua (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558511#comment-13558511
 ] 

Feng Honghua commented on HBASE-7632:
-

attach the call stack:

2013-01-10 09:22:03,084 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node 
/hbase/sdtst-miliao/replication/peers/34/peer-state already exists and this is 
not a retry
2013-01-10 09:22:03,084 ERROR 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Error while adding a new peer
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /hbase/sdtst-miliao/replication/peers/34/peer-state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:420)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:402)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:852)
at 
org.apache.hadoop.hbase.replication.ReplicationPeer.startStateTracker(ReplicationPeer.java:82)
at 
org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:344)
at 
org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:307)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$PeersWatcher.nodeChildrenChanged(ReplicationSourceManager.java:511)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:316)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)

 fail to create ReplicationSource if ReplicationPeer.startStateTracker 
 checkExists(peerStateNode) and find not exist but fails in createAndWatch due 
 to client/shell is done creating it then, now throws exception and results in 
 addPeer fail
 --

 Key: HBASE-7632
 URL: https://issues.apache.org/jira/browse/HBASE-7632
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.2, 0.94.3, 0.94.4
Reporter: Feng Honghua
   Original Estimate: 48h
  Remaining Estimate: 48h

 fail to create ReplicationSource if ReplicationPeer.startStateTracker 
 checkExists(peerStateNode) and find not exist but fails in createAndWatch due 
 to client/shell is done creating it then, now throws exception and results in 
 addPeer fail

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


<    1   2   3   4   5   6   >