[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Samarth Jain (JIRA) Wed, 15 Apr 2015 23:39:07 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497624#comment-14497624
 ]


Samarth Jain commented on PHOENIX-1779:
---------------------------------------

bq. When is this else case ever executed in getIterators(), as you initialize 
this in the constructor? Seems like you just need the if statement:
{code}
+    private List<RoundRobinIteratorState> getIterators() throws SQLException {
+        if (closed) { return Collections.emptyList(); }
+        if (openIterators.size() > 0 && openIterators.size() == 
numScannersCacheExhausted) {
+            /*
+             * All the scanners have exhausted their cache. Submit the 
scanners back to the pool so that they can fetch
+             * the next batch of records in parallel.
+             */
+            initOpenIterators(fetchNextBatch());
+        } else if (openIterators.size() == 0 && resultIterators != null) {
+            List<PeekingResultIterator> iterators = 
resultIterators.getIterators();
+            initOpenIterators(iterators);
+        }
+        return openIterators;
+    }
+
{code}

We have two constructors:
{code}
public RoundRobinResultIterator(ResultIterators iterators, QueryPlan plan) {
        this.resultIterators = iterators;
        this.plan = plan;
        this.threshold = getThreshold();
}

public RoundRobinResultIterator(List<PeekingResultIterator> iterators, 
QueryPlan plan) {
        this.resultIterators = null;
        this.plan = plan;
        this.threshold = getThreshold();
        initOpenIterators(iterators);
}
{code}

The first one is called from ScanPlan and the second one from 
PhoenixRecordReader. The else block is used when the RoundRobinResultIterator 
is called from the ScanPlan. The idea (borrowed from ConcatResultIterator) is 
to call resultIterators.getIterators() only when needed.

bq. I feel like the currentIterator() logic could be simplified a bit. 
Let me see what I can do here. It would probably help to inline the code within 
next() itself like you suggested. That indirection isn't helping. Also like the 
suggestion of moving Tuple into RoundRobinIteratorState.

> Parallelize fetching of next batch of records for scans corresponding to 
> queries with no order by 
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1779
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1779
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Assignee: Samarth Jain
>         Attachments: PHOENIX-1779.patch, PHOENIX-1779_v2.patch, wip.patch, 
> wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load 
> only the first batch of records up to the scan's cache size in parallel. 
> Loading of subsequent batches of records in scanners is essentially serial. 
> This could be improved especially for queries, including the ones with no 
> order by clauses,  that do not need any kind of merge sort on the client. 
> This could also potentially improve the performance of UPSERT SELECT 
> statements that load data from one table and insert into another. One such 
> use case being creating immutable indexes for tables that already have data. 
> It could also potentially improve the performance of our MapReduce solution 
> for bulk loading data by improving the speed of the loading/mapping phase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Reply via email to