[
https://issues.apache.org/jira/browse/PHOENIX-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355481#comment-15355481
]
Samarth Jain edited comment on PHOENIX-2724 at 6/29/16 5:03 PM:
----------------------------------------------------------------
So I debugged this by adding a bunch of log lines. And it turns out the culprit
here isn't the size of stats cache. The default cache size of is more than
enough to fit in all the guide posts in memory. The below code in
BaseResultIterators is taking up majority of the time when we have large number
of guideposts.
{code}
try {
while (guideIndex < gpsSize &&
(currentGuidePost.compareTo(endKey) <= 0 || endKey.length == 0)) {
Scan newScan = scanRanges.intersectScan(scan,
currentKeyBytes, currentGuidePostBytes, keyOffset,
false);
if(newScan != null) {
ScanUtil.setLocalIndexAttributes(newScan,
keyOffset, regionInfo.getStartKey(),
regionInfo.getEndKey(), newScan.getStartRow(),
newScan.getStopRow());
}
if (newScan != null) {
estimatedRows += gps.getRowCounts().get(guideIndex);
estimatedSize +=
gps.getByteCounts().get(guideIndex);
}
scans = addNewScan(parallelScans, scans, newScan,
currentGuidePostBytes, false, regionLocation);
currentKeyBytes = currentGuidePost.copyBytes();
currentGuidePost = PrefixByteCodec.decode(decoder,
input);
currentGuidePostBytes = currentGuidePost.copyBytes();
guideIndex++;
}
}
catch (EOFException e) {}
{code}
Another interesting thing to note is the first call to explain plan takes way
longer than subsequent ones. And almost all of that time is spent in the above
loop. That majority of time does *not* include the time spent in fetching stats
from server or the table metadata. Also of note is that the first explain plan
results in an EOFException which we end up ignoring. [~jamestaylor] - should we
be just swallowing the exception? Does it point to a bug in our stats
collection or prefix encoding/decoding? I am using the same phoenix connection.
EOF not getting triggered on subsequent executions tells me that there is
possibly some stale state that isn't getting cleared out when we are closing
phoenix statements and/or result sets.
+----------------------------------------------------------------------------------------------------+
| PLAN
|
+----------------------------------------------------------------------------------------------------+
| CLIENT 75130-CHUNK 7672485 ROWS 2775600674 BYTES SERIAL 1-WAY ROUND ROBIN
FULL SCAN OVER LINEITEM |
| SERVER 1 ROW LIMIT
|
| CLIENT 1 ROW LIMIT
|
+----------------------------------------------------------------------------------------------------+
3 rows selected (32.121 seconds)
Subsequent executions of the explain plan/query compilation take lesser time
but still in the order of seconds.
+----------------------------------------------------------------------------------------------------+
| PLAN
|
+----------------------------------------------------------------------------------------------------+
| CLIENT 75130-CHUNK 7672485 ROWS 2775600674 BYTES SERIAL 1-WAY ROUND ROBIN
FULL SCAN OVER LINEITEM |
| SERVER 1 ROW LIMIT
|
| CLIENT 1 ROW LIMIT
|
+----------------------------------------------------------------------------------------------------+
3 rows selected (4.729 seconds)
I couldn't identify one particular method call in the loop mentioned above that
is taking a long time. It is probably an accumulation of several millisecond
calls. Will keep digging.
was (Author: samarthjain):
So I debugged this by adding a bunch of log lines. And it turns out the culprit
here isn't the size of stats cache. The default cache size of is more than
enough to fit in all the guide posts in memory. The below code in
BaseResultIterators is taking up majority of the time when we have large number
of guideposts.
{code}
try {
while (guideIndex < gpsSize &&
(currentGuidePost.compareTo(endKey) <= 0 || endKey.length == 0)) {
Scan newScan = scanRanges.intersectScan(scan,
currentKeyBytes, currentGuidePostBytes, keyOffset,
false);
if(newScan != null) {
ScanUtil.setLocalIndexAttributes(newScan,
keyOffset, regionInfo.getStartKey(),
regionInfo.getEndKey(), newScan.getStartRow(),
newScan.getStopRow());
}
if (newScan != null) {
estimatedRows += gps.getRowCounts().get(guideIndex);
estimatedSize +=
gps.getByteCounts().get(guideIndex);
}
scans = addNewScan(parallelScans, scans, newScan,
currentGuidePostBytes, false, regionLocation);
currentKeyBytes = currentGuidePost.copyBytes();
currentGuidePost = PrefixByteCodec.decode(decoder,
input);
currentGuidePostBytes = currentGuidePost.copyBytes();
guideIndex++;
}
}
catch (EOFException e) {}
{code}
Another interesting thing to note is the first call to explain plan takes way
longer than subsequent ones. Also of note is that the first explain plan also
results in an EOFException which we end up ignoring. @James Taylor - should we
be just swallowing the exception? Does it point to a bug in our stats
collection or prefix encoding/decoding? I am using the same phoenix connection.
EOF not getting triggered on subsequent executions tells me that there is
possibly some stale state that isn't getting cleared out when we are closing
phoenix statements and/or result sets.
+----------------------------------------------------------------------------------------------------+
| PLAN
|
+----------------------------------------------------------------------------------------------------+
| CLIENT 75130-CHUNK 7672485 ROWS 2775600674 BYTES SERIAL 1-WAY ROUND ROBIN
FULL SCAN OVER LINEITEM |
| SERVER 1 ROW LIMIT
|
| CLIENT 1 ROW LIMIT
|
+----------------------------------------------------------------------------------------------------+
3 rows selected (32.121 seconds)
Subsequent executions of the explain plan/query compilation take lesser time
but still in the order of seconds.
+----------------------------------------------------------------------------------------------------+
| PLAN
|
+----------------------------------------------------------------------------------------------------+
| CLIENT 75130-CHUNK 7672485 ROWS 2775600674 BYTES SERIAL 1-WAY ROUND ROBIN
FULL SCAN OVER LINEITEM |
| SERVER 1 ROW LIMIT
|
| CLIENT 1 ROW LIMIT
|
+----------------------------------------------------------------------------------------------------+
3 rows selected (4.729 seconds)
I couldn't identify one particular method call in the loop mentioned above that
is taking a long time. It is probably an accumulation of several millisecond
calls. Will keep digging.
> Query with large number of guideposts is slower compared to no stats
> --------------------------------------------------------------------
>
> Key: PHOENIX-2724
> URL: https://issues.apache.org/jira/browse/PHOENIX-2724
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.7.0
> Environment: Phoenix 4.7.0-RC4, HBase-0.98.17 on a 8 node cluster
> Reporter: Mujtaba Chohan
> Assignee: Samarth Jain
> Fix For: 4.8.0
>
> Attachments: PHOENIX-2724.patch, PHOENIX-2724_addendum.patch,
> PHOENIX-2724_v2.patch
>
>
> With 1MB guidepost width for ~900GB/500M rows table. Queries with short scan
> range gets significantly slower.
> Without stats:
> {code}
> select * from T limit 10; // query execution time <100 msec
> {code}
> With stats:
> {code}
> select * from T limit 10; // query execution time >20 seconds
> Explain plan: CLIENT 876085-CHUNK 476569382 ROWS 876060986727 BYTES SERIAL
> 1-WAY FULL SCAN OVER T SERVER 10 ROW LIMIT CLIENT 10 ROW LIMIT
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)