[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-13 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741712#comment-16741712
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/14/19 3:51 AM:
---

Thanks for your reply. [~stack]. 
bq. Do you need the change in NoTagsByteBufferKeyValue. It inherits from 
ByteBufferKeyValue which has your change.
OK, seems no need to. will remove the getSerializedSize() in 
NoTagsByteBufferKeyValue. 
bq. Is it safe doing a return of this.length in SizeCachedKeyValue ? It caches 
rowLen and keyLen... Does it cache this.length? Maybe its caching what is 
passed in on construction?
Yeah, this.length was cached when constructing a KeyValue.  the 
getSerializedSize() means if have tags then return length with tags, if no tags 
then return length without tags. So I think it's OK that just return the 
this.size in SizeCachedKeyValue. 
bq. That addition to addSize in RSRpcServices is crptic. We need that sir? Say 
more why you are doing the accounting on the outside?
Well, I was thinking in the wrong way. No need this outside accounting any 
more, because now the getSerializedSize() is very lightweight now.
bq. It should go back to branch-2.0?
Theoretically, I think it should.  but it seems that there may be compatibility 
issues?  If users create their own Cell,  the compile would not pass ?  but if 
not include in branch-2.0, seems the performance degradation isn't not 
acceptable. 



was (Author: openinx):
Thanks for your reply. [~stack]. 
bq. Do you need the change in NoTagsByteBufferKeyValue. It inherits from 
ByteBufferKeyValue which has your change.
OK, seems no need to. will remove the getSerializedSize() in 
NoTagsByteBufferKeyValue. 
bq. Is it safe doing a return of this.length in SizeCachedKeyValue ? It caches 
rowLen and keyLen... Does it cache this.length? Maybe its caching what is 
passed in on construction?
Yeah, this.length was cached when constructing a KeyValue.  the 
getSerializedSize() means if have tags then return length with tags, if no tags 
then return length without tags. So I think it's OK that just return the 
this.size in SizeCachedKeyValue. 
bq. That addition to addSize in RSRpcServices is crptic. We need that sir? Say 
more why you are doing the accounting on the outside?
If read cell with tags,  we will cost some cpu to get the serialiedSize(parse 
the offset&len, especially ByteBufferKeyValue). Actually we can save it by just 
get the delta between the scanner.nextRaw before and scanner.nextRaw after. 
bq. It should go back to branch-2.0?
Theoretically, I think it should.  but it seems that there may be compatibility 
issues?  If users create their own Cell,  the compile would not pass ?  but if 
not include in branch-2.0, seems the performance degradation isn't not 
acceptable. 


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> HBASE-21657.v3.patch, HBASE-21657.v3.patch, HBASE-21657.v4.patch, 
> HBASE-21657.v5.patch, HBASE-21657.v5.patch, HBASE-21657.v5.patch, 
> HBASE-21657.v6.patch, HBase1.4.9-ssd-1000-rows-flamegraph.svg, 
> HBase1.4.9-ssd-1000-rows-qps-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-patch-v4-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, 
> HBase2.0.4-without-patch-v2.png, debug-the-ByteBufferKeyValue.diff, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg, image-2019-01-07-19-03-37-930.png, 
> image-2019-01-07-19-03-55-577.png, overview-statstics-1.png, run.log
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--

[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-08 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736879#comment-16736879
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/8/19 8:39 AM:
--

I use the patch [1]  and patch.v3 in our cluster to verify what's wrong with 
the stacktrace [2].  did not see any stacktraces in our cluster,  so I guess 
maybe the flamegraph messed up the stacktrace.  btw, i found the 
KeyValueEncoder always call the getSerializedSize without tags, which mean it 
will cost much cpu for caculating the cell size (but the flamegraph did not 
show this), while tags are off 99% of time (as [~stack]  said in RB),  so maybe 
we also can optimize the encoder. 

{code}
org.apache.hadoop.hbase.ByteBufferKeyValue.getSerializedSize(ByteBufferKeyValue.java:294)
org.apache.hadoop.hbase.KeyValueUtil.getSerializedSize(KeyValueUtil.java:753)
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueEncoder.write(KeyValueCodec.java:62)
org.apache.hadoop.hbase.ipc.CellBlockBuilder.encodeCellsTo(CellBlockBuilder.java:192)
org.apache.hadoop.hbase.ipc.CellBlockBuilder.buildCellBlockStream(CellBlockBuilder.java:229)
org.apache.hadoop.hbase.ipc.ServerCall.setResponse(ServerCall.java:203)
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:161)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}

1. 
https://issues.apache.org/jira/secure/attachment/12954128/debug-the-ByteBufferKeyValue.diff
2. 
https://issues.apache.org/jira/browse/HBASE-21657?focusedCommentId=16735710&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16735710
 


was (Author: openinx):
I use the patch [1]  and patch.v3 in our cluster to verify what's wrong with 
the stacktrace [2].  did not see any stacktraces in our cluster,  so I guess 
maybe the flamegraph messed up the stacktrace.  btw, i found the 
KeyValueEncoder always call the getSerializedSize without tags, which mean it 
will cost much cpu for caculating the cell size (but the flamegraph did not 
show this), while tags are off 99% of time (as [~stack]  said in RB),  so maybe 
we also can optimize the encoder. 

1. 
https://issues.apache.org/jira/secure/attachment/12954128/debug-the-ByteBufferKeyValue.diff
2. 
https://issues.apache.org/jira/browse/HBASE-21657?focusedCommentId=16735710&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16735710
 

> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> HBASE-21657.v3.patch, HBASE-21657.v3.patch, 
> HBase1.4.9-ssd-1000-rows-flamegraph.svg, 
> HBase1.4.9-ssd-1000-rows-qps-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, 
> HBase2.0.4-without-patch-v2.png, debug-the-ByteBufferKeyValue.diff, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg, image-2019-01-07-19-03-37-930.png, 
> image-2019-01-07-19-03-55-577.png, overview-statstics-1.png, run.log
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-07 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735455#comment-16735455
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/7/19 11:04 AM:
---

Those days, I made some tests for the above cases:
||HBaseVersion||Storage||QPS&Latency||FlameGraph||Comment||
|HBase2.0.4|SSD|[^HBase2.0.4-ssd-1000-rows-qps-latency.png]|[^HBase2.0.4-ssd-1000-rows-flamegraph.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|
|HBase2.0.4 + 
patch.v2|SSD|[^HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v2-ssd-1000-rows.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|
|HBase2.0.4 + 
patch.v3|SSD|[^HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|
|HBase1.4.9|SSD|[^HBase1.4.9-ssd-1000-rows-qps-latency.png]|[^HBase1.4.9-ssd-1000-rows-flamegraph.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|

Besides, I made an overview stastics :

!image-2019-01-07-19-03-55-577.png!

we can see that: *the performance of hbase1.4.9 is almost the same as the 
HBase2.0.4 with patch.v2.* So, I think we are in the right direction to 
optimize hbase2.0 performance.

Now the problem is: how do we write the patch ? IMO, we can move the 
getSerializedSize() (without no tag param) and heapSize() to the Cell interface 
for eliminating the instanceof and class casting, also the predetermined size 
of results arraylist will help a lot, not necessary to be 1000, we can choose 
the Min(rows, 512) for avoiding cost too much memory for a scan with huge rows.

Haven't looked at the inline method in detail yet, will try to do this.

[~stack] FYI


was (Author: openinx):
Those days, I made some tests for the above cases:
||HBaseVersion||Storage||QPS&Latency||FlameGraph||Comment||
|HBase2.0.4|SSD|[^HBase2.0.4-ssd-1000-rows-qps-latency.png]|[^HBase2.0.4-ssd-1000-rows-flamegraph.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|
|HBase2.0.4 + 
patch.v2|SSD|[^HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v2-ssd-1000-rows.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|
|HBase1.4.9|SSD|[^HBase1.4.9-ssd-1000-rows-qps-latency.png]|[^HBase1.4.9-ssd-1000-rows-flamegraph.svg]|regionCount=100,
 rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%|

Besides, I made an overview stastics :

!overview-statstics-1.png!

we can see that:  *the performance of hbase1.4.9 is almost the same as the 
HBase2.0.4 with patch.v2.*  So, I think we are in the right direction to 
optimize hbase2.0 performance. 

Now the problem is: how do we write the patch ? IMO,  we can move the 
getSerializedSize() (without no tag param)  and heapSize() to the Cell 
interface for eliminating the instanceof and class casting,   also the 
predetermined size of results arraylist will help a lot, not necessary to be 
1000, we can choose the Min(rows, 512) for avoiding cost too much memory for a 
scan with huge rows. 

Haven't looked at the inline method in detail yet, will try to do this.

[~stack] FYI

> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> HBASE-21657.v3.patch, HBASE-21657.v3.patch, 
> HBase1.4.9-ssd-1000-rows-flamegraph.svg, 
> HBase1.4.9-ssd-1000-rows-qps-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-patch-v2-ssd-1000-rows.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, 
> HBase2.0.4-ssd-1000-rows-flamegraph.svg, 
> HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, 
> HBase2.0.4-without-patch-v2.png, hbase2.0.4-ssd-scan-traces.2.svg, 
> hbase2.0.4-ssd-scan-traces.svg, hbase20-ssd-100-scan-traces.svg, 
> image-2019-01-07-19-03-37-930.png, image-2019-01-07-19-03-55-577.png, 
> overview-statstics-1.png, run.log
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the

[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-03 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/4/19 5:36 AM:
--

I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2:
||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency||
|HBase2.0.4 without patch.v2|9979.8 
ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%|[^HBase2.0.4-without-patch-v2.png]|
|HBase2.0.4 with patch.v2|14392.7 
ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%|[^HBase2.0.4-with-patch.v2.png]|

So , we can see there's big diff between those two case.  After appling 
patch.v2, we got about ~ 40% throughtput  improvement. 

BTW, my testing environment were: 
 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
 and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code:java}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}


was (Author: openinx):
I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2:
||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency||
|HBase2.0.4 without patch.v2|9979.8 
ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| 
[Graph.1|https://issues.apache.org/jira/secure/attachment/12953712/HBase2.0.4-without-patch-v2.png]
 |
|HBase2.0.4 with patch.v2|14392.7 
ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| 
[Graph.2|https://issues.apache.org/jira/secure/attachment/12953711/HBase2.0.4-with-patch.v2.png]
 |

Later, I'll provide more details about the QPS & latency.

BTW, my testing environment were: 
 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
 and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code:java}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> HBase2.0.4-with-patch.v2.png, HBase2.0.4-without-patch-v2.png, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-03 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/4/19 5:33 AM:
--

I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2:
||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency||
|HBase2.0.4 without patch.v2|9979.8 
ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| 
[Graph.1|https://issues.apache.org/jira/secure/attachment/12953712/HBase2.0.4-without-patch-v2.png]
 |
|HBase2.0.4 with patch.v2|14392.7 
ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| 
[Graph.2|https://issues.apache.org/jira/secure/attachment/12953711/HBase2.0.4-with-patch.v2.png]
 |

Later, I'll provide more details about the QPS & latency.

BTW, my testing environment were: 
 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
 and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code:java}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}


was (Author: openinx):
I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2:
||Comparision||QPS|FlameGraph|L2 cacheHitRatio|
|HBase2.0.4 without patch.v2|9979.8 
ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%|
|HBase2.0.4 with patch.v2|14392.7 
ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%|

Later, I'll provide more details about the QPS & latency.

BTW, my testing environment were: 
 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
 and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code:java}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> HBase2.0.4-with-patch.v2.png, HBase2.0.4-without-patch-v2.png, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-03 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/4/19 3:31 AM:
--

I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2:
||Comparision||QPS|FlameGraph|L2 cacheHitRatio|
|HBase2.0.4 without patch.v2|9979.8 
ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%|
|HBase2.0.4 with patch.v2|14392.7 
ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%|

Later, I'll provide more details about the QPS & latency.

BTW, my testing environment were: 
 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
 and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code:java}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}


was (Author: openinx):
I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2: 

|| Comparision || QPS| FlameGraph| L2 cacheHitRatio|
|HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] 
| ~95%|
|HBase2.0.4 with patch.v2| 9979.8 ops/sec|  [^hbase2.0.4-ssd-scan-traces.2.svg] 
| ~95% |

Later, I'll provide more details about the QPS & latency.

BTW,  my testing environment were: 
5 Nodes :  12*800G SSD / 24 Core / 128GB memory  (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

 


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-03 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/3/19 2:06 PM:
--

I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2: 

|| Comparision || QPS| FlameGraph| L2 cacheHitRatio|
|HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] 
| ~95%|
|HBase2.0.4 with patch.v2| 9979.8 ops/sec|  [^hbase2.0.4-ssd-scan-traces.2.svg] 
| ~95% |

Later, I'll provide more details about the QPS & latency.

BTW,  my testing environment were: 
5 Nodes :  12*800G SSD / 24 Core / 128GB memory  (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

 



was (Author: openinx):
I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2: 

|| Comparision || QPS| FlameGraph|
|HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] 
|
|HBase2.0.4 with patch.v2| 9979.8 ops/sec|  [^hbase2.0.4-ssd-scan-traces.2.svg] 
|

Later, I'll provide more details about the QPS & latency.

BTW,  my testing environment were: 
5 Nodes :  12*800G SSD / 24 Core / 128GB memory  (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

 


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-03 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/3/19 2:04 PM:
--

I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2: 

|| Comparision || QPS| FlameGraph|
|HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] 
|
|HBase2.0.4 with patch.v2| 9979.8 ops/sec|  [^hbase2.0.4-ssd-scan-traces.2.svg] 
|

Later, I'll provide more details about the QPS & latency.

BTW,  my testing environment were: 
5 Nodes :  12*800G SSD / 24 Core / 128GB memory  (50G onheap + 50G offheap for 
each RS, and allocated 36G for BucketCache). 
and I used the YCSB 100% scan workload (by default, the ycsb will generate a 
scan with limit in [1...1000] )
{code}
table=ycsb-test
columnfamily=C
recordcount=1
operationcount=1
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldlength=100
fieldcount=1

clientbuffering=true
  
readallfields=true
writeallfields=true
  
readproportion=0
updateproportion=0
scanproportion=1.0
insertproportion=0
  
requestdistribution=zipfian
{code}

 



was (Author: openinx):
I made an performance comparasion between hbase2.0.4 without patch.v2 and 
hbase2.0.4 with patch.v2: 

|| - || QPS| FlameGraph|
|HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] 
|
|HBase2.0.4 with patch.v2| 9979.8 ops/sec|  [^hbase2.0.4-ssd-scan-traces.2.svg] 
|

Later, I'll provide more details about the QPS & latency.


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, 
> hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, 
> hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2019-01-01 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731757#comment-16731757
 ] 

Zheng Hu edited comment on HBASE-21657 at 1/2/19 3:59 AM:
--

bq. I think this method is only called if we actually return some Cells to the 
client
That's right. 

bq. So I guess the assumption was that when the Cell need to ship over the 
network to the client anyway, that some CPU won't hurt. No longer true, I guess.
I don't think so.  because if the bottleneck was network or rpc, the 
estimatedSerializedSizeOf in flamegraph shouldn't cost so much, the methods 
related RPC should have more higher ratio. 

bq. The cells being scanned not of type ExtendedCell?
I've checked the code path and added some log.  All the cells which passed to 
PrivateCellUtil#estimatedSerializedSizeOf  were SizeCachedKeyValue* or 
ByteBufferedKeyValue (see HFileReaderImpl#getCell)... so all of them should be 
instanceof ExtendedCell.   The complicated condition sentences which may lead 
to the JVM inline  did not work Anyway, I'll provide a new performance 
report after applying patch.v1 which moved the getSerializeSize from ExtendCell 
to Cell for avoiding the frequent instanceof,  it's not a production patch, 
just for verification.



was (Author: openinx):
bq. I think this method is only called if we actually return some Cells to the 
client
That's right. 

bq. So I guess the assumption was that when the Cell need to ship over the 
network to the client anyway, that some CPU won't hurt. No longer true, I guess.
I don't think so.  because if the bottleneck was network or rpc, the 
estimatedSerializedSizeOf in flamegraph shouldn't cost so much, the methods 
related RPC should have more higher ratio. 

bq. The cells being scanned not of type ExtendedCell?
I've checked the code path and added some log.  All the cells which passed to 
PrivateCellUtil#estimatedSerializedSizeOf  were SizeCachedKeyValue* or 
ByteBufferedKeyValue (see HFileReaderImpl#getCell)... so all of them should be 
instanceof ExtendedCell.   The complicated condition sentences which lead to 
the JVM inline  did not work Anyway, I'll provide a new performance report 
after applying patch.v1 which moved the getSerializeSize from ExtendCell to 
Cell for avoiding the frequent instanceof,  it's not a production patch, just 
for verification.


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.

2018-12-29 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730673#comment-16730673
 ] 

Lars Hofhansl edited comment on HBASE-21657 at 12/29/18 12:40 PM:
--

HBASE-20459 ... different issue



was (Author: lhofhansl):
HBASE-20459


> PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% 
> scan case.
> 
>
> Key: HBASE-21657
> URL: https://issues.apache.org/jira/browse/HBASE-21657
> Project: HBase
>  Issue Type: Bug
>  Components: Performance
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21657.v1.patch, hbase20-ssd-100-scan-traces.svg
>
>
> We are evaluating the performance of branch-2, and find that the throughput 
> of scan in SSD cluster is almost the same as HDD cluster. so I made a 
> FlameGraph on RS, and found that the 
> PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it 
> has been the bottleneck in 100% scan case.
> See the [^hbase20-ssd-100-scan-traces.svg]
> BTW, in our XiaoMi branch, we introduce a 
> HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells 
> (for metric monitor), so it seems the performance loss was amplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)