[
https://issues.apache.org/jira/browse/DRILL-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268995#comment-17268995
]
ASF GitHub Bot commented on DRILL-7733:
---------------------------------------
paul-rogers opened a new pull request #2149:
URL: https://github.com/apache/drill/pull/2149
# [DRILL-7733](https://issues.apache.org/jira/browse/DRILL-7733): Use
streaming for REST JSON queries
## Description
Modifies the REST API to stream JSON query results rather than buffering the
entire result set in memory as was previously required. The buffering limited
the size of query which could be run using the REST API: users would run out of
memory. With the streaming solution, data is fed directly from the query result
to a JSON encoder and then back to the HTTP client with no buffering.
Note that Drill has historically put the result schema *after* data. The
reasoning was likely that the query schema can change many times during a query
run (with different fragments returning batches with differing schemas.) The
schema-at-end model allows the schemas to be merged.
However, with streaming, the schema-at-end model forces the client to buffer
the entire result set if the client needs the schema. A good improvement would
be to send the (first batch) schema *before* the data. Drill would somehow have
to deal with schema changes. As it turns out, ODBC and JDBC clients send the
schema before data and thus suffer from the same schema-change problem
described here. We've avoided having to address the ODBC/JDBC issue, so maybe
it won't be a problem in practice for the REST API if we send the first batch
schema before data. In any event, that would be a (simple) separate enhancement.
Refactors the existing JSON writer to work with the result set mechanism
which is then used as the implementation for streaming.
Refactors the internals of the REST API to allow for traditional "batch"
responses and the new streaming responses.
Revises the date/time methods for the row set API to use Java classes rather
than Joda. Required to integrate properly with the
JSON writer. The Joda Period class remains as there is no Java equivalent.
Most of the changed files, in fact, are for this date/time change.
A recent PR added get/set float methods to the row set API. This change was
redundant and added a large volume of code to avoid a single-instruction cast
and so is questionable. However, since we made it, we need to make it work.
This PR fixes a few holes found during this work.
## Documentation
The streaming form of JSON output is used only for REST queries:
`query.json`. It is not used for HTML. The change is invisible to the user
except that there is no longer a limit to the size of query results that the
REST API can return.
The Joda-to-Java time implementation change should be transparent to users
except in one very specific case: if users have created a provided schema that
includes a date/time format string. Such strings must be updated to Java
date/time format. Provided schema is, however, an obscure feature so it is
likely any users are affected.
## Testing
Most changes are for the Joda replacement. All tests were rerun and updated
as needed. Drill previously had no unit tests for the REST API. This PR adds a
few simple tests, and instructions for how to quickly use the test to do ad-hoc
tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Use streaming for REST JSON queries
> -----------------------------------
>
> Key: DRILL-7733
> URL: https://issues.apache.org/jira/browse/DRILL-7733
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.17.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Fix For: 1.19.0
>
>
> Several uses on the user and dev mail lists have complained about the memory
> overhead when running a REST JSON query: {{http:://node:8047/query.json}}.
> The current implementation buffers the entire result set in memory, then lets
> Jersey/Jetty convert the results to JSON. The result is very heavy heap use
> for larger query result sets.
> This ticket requests a change to use streaming. As each batch arrives at the
> Screen operator, convert that batch to JSON and directly stream the results
> to the client network connection, much as is done for the native client
> connection.
> For backward compatibility, the form of the JSON must be the same as the
> current API.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)