[
https://issues.apache.org/jira/browse/PHOENIX-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054914#comment-14054914
]
Gabriel Reid commented on PHOENIX-539:
--------------------------------------
{quote}The lease timeout is a different issue, I believe. It's cause primarily
if you're doing a group by or order by on too big a chunk of data. The client
in that case doesn't hear back from the server for a long time b/c it's busy
trying to sort/group. I believe the best solution for that is to improve the
parallelization such that smaller chunks are operated on so that the client
always hears back before the timeout occurs.{quote}
The issue I had in mind with the potential lease timeout was that that there
could be too much time between accessing each scanner if you're doing some kind
of processing on the records while iterating over the ResultSet (rather than
simply streaming the rows). For example, consider code like this:
{code}
ResultSet rs = stmt.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
// Do something that takes a few milliseconds
doSomethingExpensive(rs.getInt(1));
}
{code}
If each scanner is buffering 1000 rows at a time, and there are 10 parallel
scanners, then the {{doSomething()}} method can't take more than 6 milliseconds
per call. Six milliseconds is obviously a long time, but if the number of
scanners or size of the buffer increases by an order of magnitude, this will
drop by an order of magnitude. This might not be something that we need to
worry about -- it was actually my assumption that something like this was part
of the reason that the whole spooling thing was done in the first place.
About the GROUP BY not using the ChunkedResultIterator, I believe this is
already the case. I'm pretty sure that the only case where the
ChunkedResultIterator can be used is via a ScanPlan, and (if I'm not mistaken)
due to GROUP BYs being executed via an AggregatePlan, I'm think it's ok there.
In any case, all the integration tests pass with the current patch. If you know
of any situations where this might not be the case (i.e. GROUP BY not using an
AggregatePlan), let me know and I'll add some tests for that.
I'm going to be (mostly) offline for the coming 7 days -- do you think it's
worth committing this now, or better to wait and consider going for the
approach that [~lhofhansl] outlined? In any case, if/when I commit this I'll
certainly add the JIRA ticket for not clearing out the hash cache so that this
could work for hash joins too.
> Implement parallel scanner that does not spool to disk
> ------------------------------------------------------
>
> Key: PHOENIX-539
> URL: https://issues.apache.org/jira/browse/PHOENIX-539
> Project: Phoenix
> Issue Type: Task
> Reporter: James Taylor
> Assignee: larsh
> Attachments: PHOENIX-539.1.patch, PHOENIX-539.patch
>
>
> In scenarios where a LIMIT is not present on a non aggregate query that will
> return a lot of results, Phoenix spools the results to disk. This is less
> than ideal in these situations. @larsh has created a very good and relatively
> simple implementation that is queue based to replace this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)