gianm commented on issue #12262:
URL: https://github.com/apache/druid/issues/12262#issuecomment-1162615382
An update: we're planning to start doing PRs to set up the batch-query
version of the feature in the near future. This batch-query version runs via
the indexing service, and is suitable for ingestion and data management queries
(INSERT, REPLACE as described in #11929). Each one is meant to explore one new
area: they're split up this way to aid review and discussion. Here's the
planned sequence.
1. A more reliable RPC client for task -> task and task -> Overlord
communication. This one is important for ensuring the communication between
servers involved in a query is really robust.
2. A binary format for transferring data between servers, and for spilling
data to disk, that is more efficient than the Smile format we currently use.
This one is important because of the high amount of data being transferred
between servers. This would be in core. At first it would only be used by the
query task (see next point 3), but I do think it would make sense to use it for
more stuff over time, like Historical -> Broker communication and spilling for
GroupBy queries.
3. An indexing service task that can run Druid queries ("query task"), that
can accept a target datasource describing where to put the results, and that
can handle "external" typed datasources. This would be able to run the kinds of
queries generated by the functionality in #11929 (i.e., the kind currently
being generated in CalciteInsertDmlTest and CalciteReplaceDmlTest). This would
likely be in an extension.
4. SQL bindings for the query task. This would involve adding an endpoint
that creates of a query task rather than running a query using the
Broker-and-Historical-based stack. This change would enable an INSERT or
REPLACE query to actually execute as a query task. This would likely be in that
same extension.
5. Web console support for the new SQL syntax and endpoint.
After (3) the functionality will be usable for ingestion and data
management, but kind of obscure, because you'll need to embed a native Druid
query into an indexing service task. After (4) it will be easier to use, since
you can write SQL. After (5) it will be VERY easy to use! Then we can focus on
making it more robust, more performant, adding features, and improving ease of
use even further.
I mentioned that we've been prototyping some of the pieces of a multi-stage
query at Imply. This sequence of PRs represents a contribution of everything
we've prototyped so far. Of course, even after doing all of this, the
functionality wouldn't be well suited for low-latency queries. The indexing
service task system is not really designed for that. But there is a path
towards making low-latency queries happen in a way that shares substantial code
with this initial, task-oriented functionality. We'll be able to walk that path
in the Apache Druid codebase using these contributions as a base.
Very much looking to feedback on this work, and integrating it into Druid!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]