[ https://issues.apache.org/jira/browse/IMPALA-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604884#comment-16604884 ]
ASF subversion and git services commented on IMPALA-4268: --------------------------------------------------------- Commit b288a6af2eda9631b2bad91896ae4bfd2a3fdf30 in impala's branch refs/heads/master from [~tarmstr...@cloudera.com] [ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=b288a6a ] IMPALA-7477: Batch-oriented query set construction Rework the row-by-row construction of query result sets in PlanRootSink so that it materialises an output column at a time. Make some minor optimisations like preallocating output vectors and initialising strings more efficiently. My intent is both to make this faster and to make the QueryResultSet interface better before IMPALA-4268 does a bunch of surgery on this part of the code. Testing: Ran core tests. Perf: Downloaded tpch_parquet.orders via JDBC driver. Before: 3.01s, After: 2.57s. Downloaded l_orderkey from tpch_parquet.lineitem. Before: 1.21s, After: 1.08s. Change-Id: Ibc87a84c34935d0d5841c7f5528eb802527fa809 Reviewed-on: http://gerrit.cloudera.org:8080/11297 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > buffer more than a batch of rows at coordinator > ----------------------------------------------- > > Key: IMPALA-4268 > URL: https://issues.apache.org/jira/browse/IMPALA-4268 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 2.8.0 > Reporter: Henry Robinson > Priority: Major > Labels: resource-management > Attachments: rows-produced-histogram.png > > > In IMPALA-2905, we are introducing a {{PlanRootSink}} that handles the > production of output rows at the root of a plan. > The implementation in IMPALA-2905 has the plan execute in a separate thread > to the consumer, which calls {{GetNext()}} to retrieve the rows. However, the > sender thread will block until {{GetNext()}} is called, so that there are no > complications about memory usage and ownership due to having several batches > in flight at one time. > However, this also leads to many context switches, as each {{GetNext()}} call > yields to the sender to produce the rows. If the sender was to fill a buffer > asynchronously, the consumer could pull out of that buffer without taking a > context switch in many cases (and the extra buffering might smooth out any > performance spikes due to client delays, which currently directly affect plan > execution). > The tricky part is managing the mismatch between the size of the row batches > processed in {{Send()}} and the size of the fetch result asked for by the > client. The sender materializes output rows in a {{QueryResultSet}} that is > owned by the coordinator. That is not, currently, a splittable object - > instead it contains the actual RPC response struct that will hit the wire > when the RPC completes. As asynchronous sender cannot know the batch size, > which may change on every fetch call. So the {{GetNext()}} implementation > would need to be able to split out the {{QueryResultSet}} to match the > correct fetch size, and handle stitching together other {{QueryResultSets}} - > without doing extra copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org