[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2018-12-12 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719318#comment-16719318
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 12/12/18 7:12 PM:
-

Looking at this one again... Since it popped up in the profiler again.
I tried moving the check into StoreFileScanner or HFileScanner, but much of the 
cost unfortunately is spent on the higher level (KeyValueHeap mostly).

I wanted to come back to the discussion about how often we need to check the 
next indexed key.
While it is true that key *may* change during heap.next(), this is just a 
heuristic based on the key we're looking to have an estimate whether seeking or 
skipping would be more effective.
Right now we're always paying the cost of an extra compare per K/V  to guard 
against the rare case when the scanner switches *and* that new new scanner has 
many versions. 

So I propose again moving that compare out of the loop, and only check once, 
it's good enough for a heuristic, and not needed for correctness, and in the 
the case I'm seeing this compare represents 40% of the time spent in 
StoreScanner.next().

In gist: This is a heuristic to try to guess whether SKIP or SEEK is better. It 
only has to be mostly right. I'll file a separate Jira.

[~Apache9], [~zghaobac]


was (Author: lhofhansl):
Looking at this one again... Since it popped up in the profiler again.
I tried moving the check into StoreFileScanner or HFileScanner, but much of the 
cost unfortunately is spent on the higher level (KeyValueHeap mostly).

I wanted to come back to the discussion about how often we need to check the 
next indexed key.
While it is true that key *may* change during heap.next(), this is just a 
heuristic based on the key we're looking to have an estimate whether seeking or 
skipping would be more effective.
Right now we're always paying the cost of an extra compare per K/V  to guard 
against the rare case when the scanner switches *and* that new new scanner has 
many versions. 

So I propose again moving that compare out of the loop, and only check once, 
it's good enough for a heuristic, and not needed for correctness, and in the 
the case I'm seeing this compare represents 40% of the time spent in 
StoreScanner.next().

In gist: This is a heuristic to try to guess whether SKIP or SEEK is better. It 
only has to be mostly right. I'll file a separate Jira.

> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Fix For: 1.4.0, 2.0.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> 17958-add.txt, HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2018-05-12 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446908#comment-16446908
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 5/12/18 10:59 PM:
-

Seems we got stuck here. [~vik.karma] did some comparative testing between 0.98 
and 1.x and found a 50% slowdown for cases with many columns. This is a highly 
likely culprit.

 


was (Author: lhofhansl):
Seems we god stuck here. [~vik.karma] did some comparative testing between 0.98 
and 1.x and found a 50% slowdown for cases with many columns. This is a highly 
likely culprit.

 

> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Fix For: 1.4.0, 2.0.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> 17958-add.txt, HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2017-10-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208826#comment-16208826
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 10/18/17 5:40 AM:
-

I can fairly easily eliminate the repeated compare to the next indexed key - 
that does shave off only about 5% of the overall scan time, so not a lot, but 
measurable. I don't think there's any point to check again inside the loop.

The duplicate compares in trySkipToNextRow and trySkipToNextCol are still there 
and the column trackers are still there, though. Those could only be avoided by 
sending the fake cells all the up to the columnTracker, I agree that wasn't 
nice, maybe we can come with other way to eliminate this cost.
HBase is slow in scanning. Perhaps these micro-optimizations aren't worth it.

[~stack]


was (Author: lhofhansl):
I can fairly easily eliminate the repeated compare to the next indexed key - 
that does shave off only about 5% of the overall scan time, so not a lot, but 
measurable. I don't think there's any point to check again inside the loop.

The duplicate compares in trySkipToNextRow and trySkipToNextCol are still there 
and the column trackers are still there, though. Those could only be avoided by 
sending the fake cells all the up to the columnTracker, I agree that wasn't 
nice, maybe we can come with other way to eliminate this cost.
HBase is slow in scanning. Perhaps these micro-optimizations aren't worth it.

> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2017-10-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208801#comment-16208801
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 10/18/17 4:41 AM:
-

So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the time for the scan I am doing is spent in 
{{trySkipToNextRow}}. That is not quite what I had intended with my original 
idea. :)

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.

[~Apache9], [~zghaobac].


was (Author: lhofhansl):
So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the time for the scan I am doing is spent in 
{{trySkipToNextRow}}. That is certainly what I had intended with my original 
idea.

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.

[~Apache9], [~zghaobac].

> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2017-10-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208801#comment-16208801
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 10/18/17 4:40 AM:
-

So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the time for the scan I am doing is spent in 
{{trySkipToNextRow}}. That is certainly what I had intended with my original 
idea.

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.

[~Apache9], [~zghaobac].


was (Author: lhofhansl):
So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the scan I am doing is spent in 
{{trySkipToNextRow}}. That is certainly what I had intended with my original 
idea.

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.

[~Apache9], [~zghaobac].

> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-17958) Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP

2017-10-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208801#comment-16208801
 ] 

Lars Hofhansl edited comment on HBASE-17958 at 10/18/17 4:40 AM:
-

So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the scan I am doing is spent in 
{{trySkipToNextRow}}. That is certainly what I had intended with my original 
idea.

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.

[~Apache9], [~zghaobac].


was (Author: lhofhansl):
So I just ran a things through a profiler for some unrelated stuff and now find 
that a whopping *third* of the scan I am doing is spent in 
{{trySkipToNextRow}}. That is certainly what I had intended with my original 
idea.

Looking into that function (in branch-1) I see that each time we SKIP 
(heap.next()) we re-check the next indexed key.

It's entirely possible that I am missing something... But it seems to me we 
should check for SEEK outside of the loop once, then either return false, or 
unconditionally continue SKIP'ing along.


> Avoid passing unexpected cell to ScanQueryMatcher when optimize SEEK to SKIP
> 
>
> Key: HBASE-17958
> URL: https://issues.apache.org/jira/browse/HBASE-17958
> Project: HBase
>  Issue Type: Bug
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 0001-add-one-ut-testWithColumnCountGetFilter.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, 
> HBASE-17958-branch-1.patch, HBASE-17958-branch-1.patch, HBASE-17958-v1.patch, 
> HBASE-17958-v2.patch, HBASE-17958-v3.patch, HBASE-17958-v4.patch, 
> HBASE-17958-v5.patch, HBASE-17958-v6.patch, HBASE-17958-v7.patch, 
> HBASE-17958-v7.patch
>
>
> {code}
> ScanQueryMatcher.MatchCode qcode = matcher.match(cell);
> qcode = optimize(qcode, cell);
> {code}
> The optimize method may change the MatchCode from SEEK_NEXT_COL/SEEK_NEXT_ROW 
> to SKIP. But it still pass the next cell to ScanQueryMatcher. It will get 
> wrong result when use some filter, etc. ColumnCountGetFilter. It just count 
> the  columns's number. If pass a same column to this filter, the count result 
> will be wrong. So we should avoid passing cell to ScanQueryMatcher when 
> optimize SEEK to SKIP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)