[jira] [Commented] (PHOENIX-539) Implement parallel scanner that does not spool to disk

Gabriel Reid (JIRA) Tue, 08 Jul 2014 06:08:21 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054914#comment-14054914
 ]


Gabriel Reid commented on PHOENIX-539:
--------------------------------------

{quote}The lease timeout is a different issue, I believe. It's cause primarily 
if you're doing a group by or order by on too big a chunk of data. The client 
in that case doesn't hear back from the server for a long time b/c it's busy 
trying to sort/group. I believe the best solution for that is to improve the 
parallelization such that smaller chunks are operated on so that the client 
always hears back before the timeout occurs.{quote}

The issue I had in mind with the potential lease timeout was that that there 
could be too much time between accessing each scanner if you're doing some kind 
of processing on the records while iterating over the ResultSet (rather than 
simply streaming the rows). For example, consider code like this:
{code}
ResultSet rs = stmt.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
    // Do something that takes a few milliseconds
    doSomethingExpensive(rs.getInt(1));
}
{code}

If each scanner is buffering 1000 rows at a time, and there are 10 parallel 
scanners, then the {{doSomething()}} method can't take more than 6 milliseconds 
per call. Six milliseconds is obviously a long time, but if the number of 
scanners or size of the buffer increases by an order of magnitude, this will 
drop by an order of magnitude. This might not be something that we need to 
worry about -- it was actually my assumption that something like this was part 
of the reason that the whole spooling thing was done in the first place.

About the GROUP BY not using the ChunkedResultIterator, I believe this is 
already the case. I'm pretty sure that the only case where the 
ChunkedResultIterator can be used is via a ScanPlan, and (if I'm not mistaken) 
due to GROUP BYs being executed via an AggregatePlan, I'm think it's ok there. 
In any case, all the integration tests pass with the current patch. If you know 
of any situations where this might not be the case (i.e. GROUP BY not using an 
AggregatePlan), let me know and I'll add some tests for that.

I'm going to be (mostly) offline for the coming 7 days -- do you think it's 
worth committing this now, or better to wait and consider going for the 
approach that [~lhofhansl] outlined? In any case, if/when I commit this I'll 
certainly add the JIRA ticket for not clearing out the hash cache so that this 
could work for hash joins too.








> Implement parallel scanner that does not spool to disk
> ------------------------------------------------------
>
>                 Key: PHOENIX-539
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-539
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: James Taylor
>            Assignee: larsh
>         Attachments: PHOENIX-539.1.patch, PHOENIX-539.patch
>
>
> In scenarios where a LIMIT is not present on a non aggregate query that will 
> return a lot of results, Phoenix spools the results to disk. This is less 
> than ideal in these situations. @larsh has created a very good and relatively 
> simple implementation that is queue based to replace this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-539) Implement parallel scanner that does not spool to disk

Reply via email to