paul-rogers opened a new pull request #2075: URL: https://github.com/apache/drill/pull/2075
# DRILL-7730](https://issues.apache.org/jira/browse/DRILL-7730): Improve web query efficiency ## Description Drill provides a REST API to run queries: `http://<host>:8047/query` and `/query.json`. This PR improves the memory efficiency of these queries. Drill runs queries as a DAG of operators, rooted on the "Screen" operator. The Screen operator takes each output batch of the query and hands it over to a `UserClientConnection` object. The original design is that `UserClientConnection` corresponded to an RPC connection. So, the Screen operator converted the vectors in the outgoing batch into a `QueryWritableBatch` which is an ordered list of buffers ready to send via Netty. When the REST API was added, the simplest thing was to add a new REST-specific version of `UserClientConnection`, called `WebUserConnection`. Rather than sending our list of buffers off to the network, the web version converts the buffers back into a set of value vectors using the same deserialization code used in the Drill client. However, that deserialization code needs the data in the form of a single large buffer. So, the REST code copies the entire batch from the list of buffers into one large direct memory buffer. Then it converts that back into vectors. Clearly, all this work simply gets us back where we started: the Screen operator has a batch of vectors, the `WebUserConnection` recreates them, consuming lots of memory and CPU in the process. All of this work occurs in the query thread (not the REST request thread), making the query more costly than necessary. So, the major part of this PR is to avoid the copy: allow the REST code to work with the batch given to Screen. This is done by creating a new level of indirection, the `QueryDataPackage` class. Now, Screen simply wraps the outgoing batch of vectors in a data package and hands that off to the `UserClientConnection`. The RPC version calls a method which does the conversion from vectors into a list of buffers. But, the REST version calls a different method which returns the original batch of vectors. Voila, no more copying and no more extra direct memory overhead. The `WebUserConnection` use the vectors to create three on-heap structures: a list of column names, a list of column types, and a list of maps of rows. The rows are particularly inefficient and will be addressed in a separate PR. As it turns out, the code that handled the column and metadata list had a bug: every incoming batch of data would append another copy to the in-memory list, resulting in many redundant objects. That bug is fixed in this PR. The work to understand all this resulted in "grand tour" of parts of Drill. Much code cleanup resulted. Also, WebUserConnection` is split into two classes as part of the next phase (removing the on-heap buffered results.) ## Documentation N/A: the user visible behavior of Drill is unchanged (though REST queries might be a bit faster.) ## Testing Reran all unit tests. Though, to be fair, the test suite include basically no tests of the REST API. The test run instead ensured that nothing was broken in the main RPC pathway. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org