[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741712#comment-16741712 ] Zheng Hu edited comment on HBASE-21657 at 1/14/19 3:51 AM: --- Thanks for your reply. [~stack]. bq. Do you need the change in NoTagsByteBufferKeyValue. It inherits from ByteBufferKeyValue which has your change. OK, seems no need to. will remove the getSerializedSize() in NoTagsByteBufferKeyValue. bq. Is it safe doing a return of this.length in SizeCachedKeyValue ? It caches rowLen and keyLen... Does it cache this.length? Maybe its caching what is passed in on construction? Yeah, this.length was cached when constructing a KeyValue. the getSerializedSize() means if have tags then return length with tags, if no tags then return length without tags. So I think it's OK that just return the this.size in SizeCachedKeyValue. bq. That addition to addSize in RSRpcServices is crptic. We need that sir? Say more why you are doing the accounting on the outside? Well, I was thinking in the wrong way. No need this outside accounting any more, because now the getSerializedSize() is very lightweight now. bq. It should go back to branch-2.0? Theoretically, I think it should. but it seems that there may be compatibility issues? If users create their own Cell, the compile would not pass ? but if not include in branch-2.0, seems the performance degradation isn't not acceptable. was (Author: openinx): Thanks for your reply. [~stack]. bq. Do you need the change in NoTagsByteBufferKeyValue. It inherits from ByteBufferKeyValue which has your change. OK, seems no need to. will remove the getSerializedSize() in NoTagsByteBufferKeyValue. bq. Is it safe doing a return of this.length in SizeCachedKeyValue ? It caches rowLen and keyLen... Does it cache this.length? Maybe its caching what is passed in on construction? Yeah, this.length was cached when constructing a KeyValue. the getSerializedSize() means if have tags then return length with tags, if no tags then return length without tags. So I think it's OK that just return the this.size in SizeCachedKeyValue. bq. That addition to addSize in RSRpcServices is crptic. We need that sir? Say more why you are doing the accounting on the outside? If read cell with tags, we will cost some cpu to get the serialiedSize(parse the offset&len, especially ByteBufferKeyValue). Actually we can save it by just get the delta between the scanner.nextRaw before and scanner.nextRaw after. bq. It should go back to branch-2.0? Theoretically, I think it should. but it seems that there may be compatibility issues? If users create their own Cell, the compile would not pass ? but if not include in branch-2.0, seems the performance degradation isn't not acceptable. > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > HBASE-21657.v3.patch, HBASE-21657.v3.patch, HBASE-21657.v4.patch, > HBASE-21657.v5.patch, HBASE-21657.v5.patch, HBASE-21657.v5.patch, > HBASE-21657.v6.patch, HBase1.4.9-ssd-1000-rows-flamegraph.svg, > HBase1.4.9-ssd-1000-rows-qps-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-patch-v4-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, > HBase2.0.4-without-patch-v2.png, debug-the-ByteBufferKeyValue.diff, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg, image-2019-01-07-19-03-37-930.png, > image-2019-01-07-19-03-55-577.png, overview-statstics-1.png, run.log > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. --
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736879#comment-16736879 ] Zheng Hu edited comment on HBASE-21657 at 1/8/19 8:39 AM: -- I use the patch [1] and patch.v3 in our cluster to verify what's wrong with the stacktrace [2]. did not see any stacktraces in our cluster, so I guess maybe the flamegraph messed up the stacktrace. btw, i found the KeyValueEncoder always call the getSerializedSize without tags, which mean it will cost much cpu for caculating the cell size (but the flamegraph did not show this), while tags are off 99% of time (as [~stack] said in RB), so maybe we also can optimize the encoder. {code} org.apache.hadoop.hbase.ByteBufferKeyValue.getSerializedSize(ByteBufferKeyValue.java:294) org.apache.hadoop.hbase.KeyValueUtil.getSerializedSize(KeyValueUtil.java:753) org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueEncoder.write(KeyValueCodec.java:62) org.apache.hadoop.hbase.ipc.CellBlockBuilder.encodeCellsTo(CellBlockBuilder.java:192) org.apache.hadoop.hbase.ipc.CellBlockBuilder.buildCellBlockStream(CellBlockBuilder.java:229) org.apache.hadoop.hbase.ipc.ServerCall.setResponse(ServerCall.java:203) org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:161) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) {code} 1. https://issues.apache.org/jira/secure/attachment/12954128/debug-the-ByteBufferKeyValue.diff 2. https://issues.apache.org/jira/browse/HBASE-21657?focusedCommentId=16735710&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16735710 was (Author: openinx): I use the patch [1] and patch.v3 in our cluster to verify what's wrong with the stacktrace [2]. did not see any stacktraces in our cluster, so I guess maybe the flamegraph messed up the stacktrace. btw, i found the KeyValueEncoder always call the getSerializedSize without tags, which mean it will cost much cpu for caculating the cell size (but the flamegraph did not show this), while tags are off 99% of time (as [~stack] said in RB), so maybe we also can optimize the encoder. 1. https://issues.apache.org/jira/secure/attachment/12954128/debug-the-ByteBufferKeyValue.diff 2. https://issues.apache.org/jira/browse/HBASE-21657?focusedCommentId=16735710&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16735710 > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > HBASE-21657.v3.patch, HBASE-21657.v3.patch, > HBase1.4.9-ssd-1000-rows-flamegraph.svg, > HBase1.4.9-ssd-1000-rows-qps-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, > HBase2.0.4-without-patch-v2.png, debug-the-ByteBufferKeyValue.diff, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg, image-2019-01-07-19-03-37-930.png, > image-2019-01-07-19-03-55-577.png, overview-statstics-1.png, run.log > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735455#comment-16735455 ] Zheng Hu edited comment on HBASE-21657 at 1/7/19 11:04 AM: --- Those days, I made some tests for the above cases: ||HBaseVersion||Storage||QPS&Latency||FlameGraph||Comment|| |HBase2.0.4|SSD|[^HBase2.0.4-ssd-1000-rows-qps-latency.png]|[^HBase2.0.4-ssd-1000-rows-flamegraph.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| |HBase2.0.4 + patch.v2|SSD|[^HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v2-ssd-1000-rows.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| |HBase2.0.4 + patch.v3|SSD|[^HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| |HBase1.4.9|SSD|[^HBase1.4.9-ssd-1000-rows-qps-latency.png]|[^HBase1.4.9-ssd-1000-rows-flamegraph.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| Besides, I made an overview stastics : !image-2019-01-07-19-03-55-577.png! we can see that: *the performance of hbase1.4.9 is almost the same as the HBase2.0.4 with patch.v2.* So, I think we are in the right direction to optimize hbase2.0 performance. Now the problem is: how do we write the patch ? IMO, we can move the getSerializedSize() (without no tag param) and heapSize() to the Cell interface for eliminating the instanceof and class casting, also the predetermined size of results arraylist will help a lot, not necessary to be 1000, we can choose the Min(rows, 512) for avoiding cost too much memory for a scan with huge rows. Haven't looked at the inline method in detail yet, will try to do this. [~stack] FYI was (Author: openinx): Those days, I made some tests for the above cases: ||HBaseVersion||Storage||QPS&Latency||FlameGraph||Comment|| |HBase2.0.4|SSD|[^HBase2.0.4-ssd-1000-rows-qps-latency.png]|[^HBase2.0.4-ssd-1000-rows-flamegraph.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| |HBase2.0.4 + patch.v2|SSD|[^HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png]|[^HBase2.0.4-patch-v2-ssd-1000-rows.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| |HBase1.4.9|SSD|[^HBase1.4.9-ssd-1000-rows-qps-latency.png]|[^HBase1.4.9-ssd-1000-rows-flamegraph.svg]|regionCount=100, rows=10^7, dataSizeOfTable=1.5GB, cacheHitRatio=100%| Besides, I made an overview stastics : !overview-statstics-1.png! we can see that: *the performance of hbase1.4.9 is almost the same as the HBase2.0.4 with patch.v2.* So, I think we are in the right direction to optimize hbase2.0 performance. Now the problem is: how do we write the patch ? IMO, we can move the getSerializedSize() (without no tag param) and heapSize() to the Cell interface for eliminating the instanceof and class casting, also the predetermined size of results arraylist will help a lot, not necessary to be 1000, we can choose the Min(rows, 512) for avoiding cost too much memory for a scan with huge rows. Haven't looked at the inline method in detail yet, will try to do this. [~stack] FYI > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > HBASE-21657.v3.patch, HBASE-21657.v3.patch, > HBase1.4.9-ssd-1000-rows-flamegraph.svg, > HBase1.4.9-ssd-1000-rows-qps-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-patch-v2-ssd-1000-rows.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-patch-v3-ssd-1000-rows-qps-and-latency.png, > HBase2.0.4-ssd-1000-rows-flamegraph.svg, > HBase2.0.4-ssd-1000-rows-qps-latency.png, HBase2.0.4-with-patch.v2.png, > HBase2.0.4-without-patch-v2.png, hbase2.0.4-ssd-scan-traces.2.svg, > hbase2.0.4-ssd-scan-traces.svg, hbase20-ssd-100-scan-traces.svg, > image-2019-01-07-19-03-37-930.png, image-2019-01-07-19-03-55-577.png, > overview-statstics-1.png, run.log > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059 ] Zheng Hu edited comment on HBASE-21657 at 1/4/19 5:36 AM: -- I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: ||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency|| |HBase2.0.4 without patch.v2|9979.8 ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%|[^HBase2.0.4-without-patch-v2.png]| |HBase2.0.4 with patch.v2|14392.7 ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%|[^HBase2.0.4-with-patch.v2.png]| So , we can see there's big diff between those two case. After appling patch.v2, we got about ~ 40% throughtput improvement. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code:java} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} was (Author: openinx): I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: ||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency|| |HBase2.0.4 without patch.v2|9979.8 ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| [Graph.1|https://issues.apache.org/jira/secure/attachment/12953712/HBase2.0.4-without-patch-v2.png] | |HBase2.0.4 with patch.v2|14392.7 ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| [Graph.2|https://issues.apache.org/jira/secure/attachment/12953711/HBase2.0.4-with-patch.v2.png] | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code:java} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > HBase2.0.4-with-patch.v2.png, HBase2.0.4-without-patch-v2.png, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059 ] Zheng Hu edited comment on HBASE-21657 at 1/4/19 5:33 AM: -- I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: ||Comparision||QPS||FlameGraph||L2 cacheHitRatio||Latency|| |HBase2.0.4 without patch.v2|9979.8 ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| [Graph.1|https://issues.apache.org/jira/secure/attachment/12953712/HBase2.0.4-without-patch-v2.png] | |HBase2.0.4 with patch.v2|14392.7 ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| [Graph.2|https://issues.apache.org/jira/secure/attachment/12953711/HBase2.0.4-with-patch.v2.png] | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code:java} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} was (Author: openinx): I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: ||Comparision||QPS|FlameGraph|L2 cacheHitRatio| |HBase2.0.4 without patch.v2|9979.8 ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| |HBase2.0.4 with patch.v2|14392.7 ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code:java} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > HBase2.0.4-with-patch.v2.png, HBase2.0.4-without-patch-v2.png, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059 ] Zheng Hu edited comment on HBASE-21657 at 1/4/19 3:31 AM: -- I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: ||Comparision||QPS|FlameGraph|L2 cacheHitRatio| |HBase2.0.4 without patch.v2|9979.8 ops/sec|[^hbase2.0.4-ssd-scan-traces.svg]|~95%| |HBase2.0.4 with patch.v2|14392.7 ops/sec|[^hbase2.0.4-ssd-scan-traces.2.svg]|~95%| Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code:java} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} was (Author: openinx): I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: || Comparision || QPS| FlameGraph| L2 cacheHitRatio| |HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] | ~95%| |HBase2.0.4 with patch.v2| 9979.8 ops/sec| [^hbase2.0.4-ssd-scan-traces.2.svg] | ~95% | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059 ] Zheng Hu edited comment on HBASE-21657 at 1/3/19 2:06 PM: -- I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: || Comparision || QPS| FlameGraph| L2 cacheHitRatio| |HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] | ~95%| |HBase2.0.4 with patch.v2| 9979.8 ops/sec| [^hbase2.0.4-ssd-scan-traces.2.svg] | ~95% | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} was (Author: openinx): I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: || Comparision || QPS| FlameGraph| |HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] | |HBase2.0.4 with patch.v2| 9979.8 ops/sec| [^hbase2.0.4-ssd-scan-traces.2.svg] | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733059#comment-16733059 ] Zheng Hu edited comment on HBASE-21657 at 1/3/19 2:04 PM: -- I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: || Comparision || QPS| FlameGraph| |HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] | |HBase2.0.4 with patch.v2| 9979.8 ops/sec| [^hbase2.0.4-ssd-scan-traces.2.svg] | Later, I'll provide more details about the QPS & latency. BTW, my testing environment were: 5 Nodes : 12*800G SSD / 24 Core / 128GB memory (50G onheap + 50G offheap for each RS, and allocated 36G for BucketCache). and I used the YCSB 100% scan workload (by default, the ycsb will generate a scan with limit in [1...1000] ) {code} table=ycsb-test columnfamily=C recordcount=1 operationcount=1 workload=com.yahoo.ycsb.workloads.CoreWorkload fieldlength=100 fieldcount=1 clientbuffering=true readallfields=true writeallfields=true readproportion=0 updateproportion=0 scanproportion=1.0 insertproportion=0 requestdistribution=zipfian {code} was (Author: openinx): I made an performance comparasion between hbase2.0.4 without patch.v2 and hbase2.0.4 with patch.v2: || - || QPS| FlameGraph| |HBase2.0.4 without patch.v2|14392.7 ops/sec| [^hbase2.0.4-ssd-scan-traces.svg] | |HBase2.0.4 with patch.v2| 9979.8 ops/sec| [^hbase2.0.4-ssd-scan-traces.2.svg] | Later, I'll provide more details about the QPS & latency. > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, HBASE-21657.v2.patch, > hbase2.0.4-ssd-scan-traces.2.svg, hbase2.0.4-ssd-scan-traces.svg, > hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731757#comment-16731757 ] Zheng Hu edited comment on HBASE-21657 at 1/2/19 3:59 AM: -- bq. I think this method is only called if we actually return some Cells to the client That's right. bq. So I guess the assumption was that when the Cell need to ship over the network to the client anyway, that some CPU won't hurt. No longer true, I guess. I don't think so. because if the bottleneck was network or rpc, the estimatedSerializedSizeOf in flamegraph shouldn't cost so much, the methods related RPC should have more higher ratio. bq. The cells being scanned not of type ExtendedCell? I've checked the code path and added some log. All the cells which passed to PrivateCellUtil#estimatedSerializedSizeOf were SizeCachedKeyValue* or ByteBufferedKeyValue (see HFileReaderImpl#getCell)... so all of them should be instanceof ExtendedCell. The complicated condition sentences which may lead to the JVM inline did not work Anyway, I'll provide a new performance report after applying patch.v1 which moved the getSerializeSize from ExtendCell to Cell for avoiding the frequent instanceof, it's not a production patch, just for verification. was (Author: openinx): bq. I think this method is only called if we actually return some Cells to the client That's right. bq. So I guess the assumption was that when the Cell need to ship over the network to the client anyway, that some CPU won't hurt. No longer true, I guess. I don't think so. because if the bottleneck was network or rpc, the estimatedSerializedSizeOf in flamegraph shouldn't cost so much, the methods related RPC should have more higher ratio. bq. The cells being scanned not of type ExtendedCell? I've checked the code path and added some log. All the cells which passed to PrivateCellUtil#estimatedSerializedSizeOf were SizeCachedKeyValue* or ByteBufferedKeyValue (see HFileReaderImpl#getCell)... so all of them should be instanceof ExtendedCell. The complicated condition sentences which lead to the JVM inline did not work Anyway, I'll provide a new performance report after applying patch.v1 which moved the getSerializeSize from ExtendCell to Cell for avoiding the frequent instanceof, it's not a production patch, just for verification. > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21657) PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% scan case.
[ https://issues.apache.org/jira/browse/HBASE-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730673#comment-16730673 ] Lars Hofhansl edited comment on HBASE-21657 at 12/29/18 12:40 PM: -- HBASE-20459 ... different issue was (Author: lhofhansl): HBASE-20459 > PrivateCellUtil#estimatedSerializedSizeOf has been the bottleneck in 100% > scan case. > > > Key: HBASE-21657 > URL: https://issues.apache.org/jira/browse/HBASE-21657 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5 > > Attachments: HBASE-21657.v1.patch, hbase20-ssd-100-scan-traces.svg > > > We are evaluating the performance of branch-2, and find that the throughput > of scan in SSD cluster is almost the same as HDD cluster. so I made a > FlameGraph on RS, and found that the > PrivateCellUtil#estimatedSerializedSizeOf cost about 29% cpu, Obviously, it > has been the bottleneck in 100% scan case. > See the [^hbase20-ssd-100-scan-traces.svg] > BTW, in our XiaoMi branch, we introduce a > HRegion#updateReadRequestsByCapacityUnitPerSecond to sum up the size of cells > (for metric monitor), so it seems the performance loss was amplified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)