paul-rogers opened a new pull request #2075:
URL: https://github.com/apache/drill/pull/2075


   # DRILL-7730](https://issues.apache.org/jira/browse/DRILL-7730): Improve web 
query efficiency
   
   ## Description
   
   Drill provides a REST API to run queries: `http://<host>:8047/query` and 
`/query.json`. This PR improves the memory efficiency of these queries.
   
   Drill runs queries as a DAG of operators, rooted on the "Screen" operator. 
The Screen operator takes each output batch of the query and hands it over to a 
`UserClientConnection` object. The original design is that 
`UserClientConnection` corresponded to an RPC connection. So, the Screen 
operator converted the vectors in the outgoing batch into a 
`QueryWritableBatch` which is an ordered list of buffers ready to send via 
Netty.
   
   When the REST API was added, the simplest thing was to add a new 
REST-specific version of `UserClientConnection`, called `WebUserConnection`. 
Rather than sending our list of buffers off to the network, the web version 
converts the buffers back into a set of value vectors using the same 
deserialization code used in the Drill client. However, that deserialization 
code needs the data in the form of a single large buffer. So, the REST code 
copies the entire batch from the list of buffers into one large direct memory 
buffer. Then it converts that back into vectors.
   
   Clearly, all this work simply gets us back where we started: the Screen 
operator has a batch of vectors, the `WebUserConnection` recreates them, 
consuming lots of memory and CPU in the process. All of this work occurs in the 
query thread (not the REST request thread), making the query more costly than 
necessary.
   
   So, the major part of this PR is to avoid the copy: allow the REST code to 
work with the batch given to Screen.
   
   This is done by creating a new level of indirection, the `QueryDataPackage` 
class. Now, Screen simply wraps the outgoing batch of vectors in a data package 
and hands that off to the `UserClientConnection`. The RPC version calls a 
method which does the conversion from vectors into a list of buffers. But, the 
REST version calls a different method which returns the original batch of 
vectors. Voila, no more copying and no more extra direct memory overhead.
   
   The `WebUserConnection` use the vectors to create three on-heap structures: 
a list of column names, a list of column types, and a list of maps of rows. The 
rows are particularly inefficient and will be addressed in a separate PR. As it 
turns out, the code that handled the column and metadata list had a bug: every 
incoming batch of data would append another copy to the in-memory list, 
resulting in many redundant objects. That bug is fixed in this PR.
   
   The work to understand all this resulted in "grand tour" of parts of Drill. 
Much code cleanup resulted. Also, WebUserConnection` is split into two classes 
as part of the next phase (removing the on-heap buffered results.)
   
   ## Documentation
   
   N/A: the user visible behavior of Drill is unchanged (though REST queries 
might be a bit faster.)
   
   ## Testing
   
   Reran all unit tests. Though, to be fair, the test suite include basically 
no tests of the REST API. The test run instead ensured that nothing was broken 
in the main RPC pathway.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to