[ https://issues.apache.org/jira/browse/PHOENIX-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200775#comment-15200775 ]
Samarth Jain edited comment on PHOENIX-2724 at 3/18/16 1:32 AM: ---------------------------------------------------------------- I created a table with 300 million rows and 330K+ guideposts. I did some micro-benchmarking to see where we are spending time and this is what I have: select * from testExplainPlanTime limit 10; Time spent in computing 330045 BaseResultIterators.getParallelScans() 122 ms Included in the above time is the time spent in ScanRanges.intersectScan() 82 ms In SerialIterators.java, time spent by a single thread in creating 330045 iterators : 1589 Total time spent in above tasks = 1589 + 122 = 1711 ms Overall query time = 1809 ms So it turns out the single biggest culprit is this piece of code in SerialIterators.java: {code} @Override public PeekingResultIterator call() throws Exception { long startTime = System.currentTimeMillis(); List<PeekingResultIterator> concatIterators = Lists.newArrayListWithExpectedSize(scans.size()); for (final Scan scan : scans) { TableResultIterator scanner = new TableResultIterator(mutationState, tableRef, scan, context.getReadMetricsQueue().allotMetric(SCAN_BYTES, tableName), renewLeaseThreshold); conn.addIterator(scanner); concatIterators.add(iteratorFactory.newIterator(context, scanner, scan, tableName)); } PeekingResultIterator concatIterator = ConcatResultIterator.newIterator(concatIterators); allIterators.add(concatIterator); System.out.println("Serial iterators - time taken to create " + scans.size() + " iterators : " + (System.currentTimeMillis() - startTime)); return concatIterator; } {code} Looping over 330K+ scans and creating iterators out of them takes up much of the query time. was (Author: samarthjain): I created a table with 300 million rows and 330K+ guideposts. I did some micro-benchmarking to see where we are spending time and this is what I have: select * from testExplainPlanTime limit 10; Time spent in computing 698818 BaseResultIterators.getParallelScans() 1858 ms In SerialIterators.java, time spent by the thread in creating 698818 iterators : 3644 Total time taken: Time spent in computing 330045 BaseResultIterators.getParallelScans() 122 ms Included in the above time is the time spent in ScanRanges.intersectScan() 82 ms In SerialIterators.java, time spent by a single thread in creating 330045 iterators : 1589 Total time spent in above tasks = 1589 + 122 = 1711 ms Overall query time = 1809 ms So it turns out the single biggest culprit is this piece of code in SerialIterators.java: {code} @Override public PeekingResultIterator call() throws Exception { long startTime = System.currentTimeMillis(); List<PeekingResultIterator> concatIterators = Lists.newArrayListWithExpectedSize(scans.size()); for (final Scan scan : scans) { TableResultIterator scanner = new TableResultIterator(mutationState, tableRef, scan, context.getReadMetricsQueue().allotMetric(SCAN_BYTES, tableName), renewLeaseThreshold); conn.addIterator(scanner); concatIterators.add(iteratorFactory.newIterator(context, scanner, scan, tableName)); } PeekingResultIterator concatIterator = ConcatResultIterator.newIterator(concatIterators); allIterators.add(concatIterator); System.out.println("Serial iterators - time taken to create " + scans.size() + " iterators : " + (System.currentTimeMillis() - startTime)); return concatIterator; } {code} Looping over 330K+ scans and creating iterators out of them takes up much of the query time. > Query with large number of guideposts is slower compared to no stats > -------------------------------------------------------------------- > > Key: PHOENIX-2724 > URL: https://issues.apache.org/jira/browse/PHOENIX-2724 > Project: Phoenix > Issue Type: Bug > Affects Versions: 4.7.0 > Environment: Phoenix 4.7.0-RC4, HBase-0.98.17 on a 8 node cluster > Reporter: Mujtaba Chohan > Assignee: Samarth Jain > Fix For: 4.8.0 > > > With 1MB guidepost width for ~900GB/500M rows table. Queries with short scan > range gets significantly slower. > Without stats: > {code} > select * from T limit 10; // query execution time <100 msec > {code} > With stats: > {code} > select * from T limit 10; // query execution time >20 seconds > Explain plan: CLIENT 876085-CHUNK 476569382 ROWS 876060986727 BYTES SERIAL > 1-WAY FULL SCAN OVER T SERVER 10 ROW LIMIT CLIENT 10 ROW LIMIT > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)