[GitHub] [druid] gianm commented on issue #12262: Multi-stage distributed queries

GitBox Tue, 21 Jun 2022 21:10:48 -0700


gianm commented on issue #12262:
URL: https://github.com/apache/druid/issues/12262#issuecomment-1162615382


   An update: we're planning to start doing PRs to set up the batch-query 
version of the feature in the near future. This batch-query version runs via 
the indexing service, and is suitable for ingestion and data management queries 
(INSERT, REPLACE as described in #11929). Each one is meant to explore one new 
area: they're split up this way to aid review and discussion. Here's the 
planned sequence.
   
   1. A more reliable RPC client for task -> task and task -> Overlord 
communication. This one is important for ensuring the communication between 
servers involved in a query is really robust.
   
   2. A binary format for transferring data between servers, and for spilling 
data to disk, that is more efficient than the Smile format we currently use. 
This one is important because of the high amount of data being transferred 
between servers. This would be in core. At first it would only be used by the 
query task (see next point 3), but I do think it would make sense to use it for 
more stuff over time, like Historical -> Broker communication and spilling for 
GroupBy queries.
   
   3. An indexing service task that can run Druid queries ("query task"), that 
can accept a target datasource describing where to put the results, and that 
can handle "external" typed datasources. This would be able to run the kinds of 
queries generated by the functionality in #11929 (i.e., the kind currently 
being generated in CalciteInsertDmlTest and CalciteReplaceDmlTest). This would 
likely be in an extension.
   
   4. SQL bindings for the query task. This would involve adding an endpoint 
that creates of a query task rather than running a query using the 
Broker-and-Historical-based stack. This change would enable an INSERT or 
REPLACE query to actually execute as a query task. This would likely be in that 
same extension.
   
   5. Web console support for the new SQL syntax and endpoint.
   
   After (3) the functionality will be usable for ingestion and data 
management, but kind of obscure, because you'll need to embed a native Druid 
query into an indexing service task. After (4) it will be easier to use, since 
you can write SQL. After (5) it will be VERY easy to use! Then we can focus on 
making it more robust, more performant, adding features, and improving ease of 
use even further.
   
   I mentioned that we've been prototyping some of the pieces of a multi-stage 
query at Imply. This sequence of PRs represents a contribution of everything 
we've prototyped so far. Of course, even after doing all of this, the 
functionality wouldn't be well suited for low-latency queries. The indexing 
service task system is not really designed for that. But there is a path 
towards making low-latency queries happen in a way that shares substantial code 
with this initial, task-oriented functionality. We'll be able to walk that path 
in the Apache Druid codebase using these contributions as a base.
   
   Very much looking to feedback on this work, and integrating it into Druid!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on issue #12262: Multi-stage distributed queries

Reply via email to