gianm opened a new issue #8728: Initial join support
URL: https://github.com/apache/incubator-druid/issues/8728
 
 
   ## Motivation
   
   Druid aims to be a more powerful analytical database, and implementing joins 
is a very common ask from the user community. Druid does support some related 
features today:
   
   - [Lookups](https://druid.apache.org/docs/latest/querying/lookups.html) 
enable certain star-schema use cases, and are similar to joins in that regard. 
But they are limited in that they can only support LEFT joins of 
fact-to-dimension (not right/inner/outer) and that they cannot support more 
than one value per key.
   - Druid SQL supports 
[semijoins](https://druid.apache.org/docs/latest/querying/sql#query-execution) 
through both the `WHERE x IN (SELECT ...)` syntax and traditional JOIN syntax. 
But it only supports one subquery per SQL query, and does not support negation 
("NOT IN").
   
   Real JOIN support would be more powerful, enabling even more star-schema and 
subquery-based use cases:
   
   - More flexibility for join types (LEFT/RIGHT/INNER/FULL OUTER).
   - More flexibility for join conditions.
   - Joining against more than one subquery at once.
   - Easier integration with third-party applications that generate JOIN 
queries for star schemas.
   - Open the door for joining two datasources together.
   
   ## Proposed changes
   
   The idea is to add a "join" datasource, expose it through SQL, and add 
machinery to brokers and data servers to allow hash-based equijoins of **zero 
or one "table" datasource** and **any number of "lookup", "inline", or "query" 
datasources**. As a side effect, these proposed changes will add the ability to 
query lookups directly, using a datasource of type "lookup".
   
   I think this subset of functionality is a good start, because it adds 
meaningful new capabilities and helps unify existing ones. Things that are not 
in scope for this initial proposal include: joining two "table" datasources to 
each other, non-equijoins, non-hash-based joins. I think these should all be 
implemented at some point, as future work. 
   
   There are four main areas to this proposal,
   
   1. SQL: add rules to plan joins, adding "lookup" schema.
   2. Native queries: add "join", "lookup", and "inline" datasources.
   3. Broker: must be able to translate anything that data servers can’t do 
into something they can do (e.g. evaluate a subquery and replace it with an 
"inline" datasource). Must also be able to fully evaluate queries that only use 
local datasource like "lookup" and "inline".
   4. Data servers (historical, etc): must be able to evaluate joins of _one 
table_ onto _any number of lookup or inline datasources, or queries on top of 
those_.
   
   The next few sections expand on these areas.
   
   ### SQL
   
   1. Add lookups to SQL in a "lookup" schema.
   2. Allow any equi-joins that involve zero or one normal datasource and any 
number of lookups or subqueries, through a new join-planning system.
   3. Remove semi-join specific planning code, since that should now be handled 
by the generic join planning code in (2) above.
   
   An example SQL query might be:
   
   ```
   SELECT products.value AS product_name, SUM(sales.revenue)
   FROM sales
   LEFT JOIN lookup.products ON sales.product_id = products.key
   GROUP BY products.value
   ```
   
   This query takes advantage of the fact that unqualified tables like sales 
are assumed to be normal datasources. Lookups are referenced as part of the 
lookup schema, like lookup.products.
   
   Multiple join queries can be specified per SQL query. We will need to guide 
Calcite’s cost-based optimizer towards reordering them optimally. This may 
require more statistics than we currently possess, so adding these and 
improving join plans may end up being an area of future work.
   
   ### Native queries
   
   1. Add a "join" datasource type that represents a join of two other 
datasources. It’d include a left datasource, right datasource, condition (which 
I am thinking will be restricted to equality at first), and type (left, right, 
full outer, inner).
   2. Add "lookup" and "inline" datasources to provide things to join onto. 
These can be specified as inputs to a join, or they can be directly queried (a 
new capability for lookups!)’
   3. Allow joining on to "query" datasources as well. To make this work, we’ll 
need to add a sense of a ‘standard translation’ of results from certain query 
types into flat schemas that we can offer column selectors on top of. There may 
be more than one way to do this, since certain query types (notably, topN and 
scan) return nested results in some cases. We could do this by adding a new 
QueryToolChest method.
   
   The rows coming out of a join datasource would be the result of the join. 
Any query type could use a join datasource without being aware of the fact that 
joins exist.
   
   Join datasources can be nested within each other. Unlike SQL, native query 
evaluation will not reorder joins. It will execute them in the order that the 
join tree is provided.
   
   Probably will not allow joining on "table" datasources, except as the 
extreme left-hand side of a join.
   
   In order to protect against column name ambiguity (what if the left and 
right side have a column of the same name?), I propose adding a "rightPrefix" 
parameter to the join datasource. This would be prefixed to every column name 
coming from the right side, and should be chosen by the caller to be something 
that won’t conflict with any left-side columns. Druid SQL will choose one 
automatically when planning a SQL join.
   
   The join datasource used by the earlier SQL query, above, would be:
   
   ```
   "dataSource": {
     "type": "join",
     "left": "sales",
     "right": {
       "type": "lookup",
       "lookupName": "products"
     },
     "rightPrefix": "foobar",
     "joinType": "left",
     "condition": {
       "leftColumn": "product_id",
       "rightColumn": "key"
       "rightPrefix": "something"
     }
   }
   ```
   
   ### Broker behavior
   
   The following technique should be implemented by CachingClusteredClient. The 
idea is to either evaluate the query locally, or else get it into a format that 
data servers can handle (see next section).
   
   1. Analyze the datasource to find all tables and subqueries on tables.
   2. Validate that there is at most one table datasource that is not wrapped 
in a subquery, and if there is one, it is in the leftmost leaf position. If 
this validation fails, return an error.
   3. Evaluate all subqueries on tables ("query" datasource with "table" 
child), and transform the master datasource by replacing those subquery 
datasources with "inline" datasources. Do not do this on "table" datasources 
that are not wrapped in a query. Keep a running counter of how many rows have 
been materialized in this way, and return an error if it grows too large.
   4. If there is a table in the leftmost leaf position, send the query down to 
data servers without further modifications beyond those done in step (3). Use 
the single table datasource to determine which data servers and which segments 
to query.
   5. If there are no table datasources remaining after the transformations in 
(3), evaluate the query locally, on the Broker. Creating a virtual Segment on 
top of the local data, and run the appropriate query engine on that Segment.
   
   ### Data server behavior
   
   The following technique should be implemented by ServerManager (historical) 
and SinkQuerySegmentWalker (indexer) when a join datasource is encountered.
   
   1. Analyze the join datasource to find the "primary" datasource: the 
leftmost leaf datasource. This must be a table datasource, and will be the base 
for the join. (Note that we are assuming here that the Broker would only send 
down the query if there were exactly one table.)
   Enumerate all other leaf datasources in depth-first pre-order. We will apply 
join clauses in this order.
   2. Create a hash table for each non-primary leaf datasource, unless it 
already exists (e.g. no need to create a hash table for a lookup, but we might 
create one for a query on top of a lookup). If this is impossible for any 
datasource, like if one of them is a regular "table" datasource, return an 
error.
   3. Create a virtual Segment for each Segment of the primary datasource that 
wraps up all the joining work, and returns a StorageAdapater + 
ColumnSelectorFactory representing the results of the join. Pass this to query 
engines as normal.
   
   ## Rationale
   
   Some other alternatives considered were:
   
   - A "join" query type. @jihoonson did work towards this in 
https://github.com/apache/incubator-druid/pull/4118, but the patch was not 
merged. Reconsidering this design now, I think a "join" datasource would work 
better than a query type, because it composes more nicely. We can expose the 
results of a join datasource to query engines the same way as we expose a table 
(via Segment + StorageAdapter). It is naturally lazy (no need to compute 
columns the query won’t use) and will work with all existing query engines 
without a need to modify them.
   - Joining onto broadcast datasources instead of lookups. I think we should 
do both over time, but chose to start with lookups because their usage is more 
prevalent in Druid today. I am hoping to allow people to use this popular 
feature through a more familiar (and powerful!) SQL syntax.
   
   ## Operational impact
   
   None expected.
   
   ## Test plan
   
   Add lots of unit tests in the druid-sql and druid-processing modules.
   
   ## Future work
   
   Out of scope for this proposal, but would be nice in the future:
   
   - Adding support for non-equijoins.
   - Adding support for joining onto broadcast datasources, not just lookups.
   - Adding support for joining two distributed table datasources. (This one 
gets complex quickly and there are a lot of sub-cases we want to consider.)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to