[GitHub] [druid] gianm opened a new pull request #9111: Add HashJoinSegment, a virtual segment for joins.

GitBox Mon, 30 Dec 2019 08:20:16 -0800

gianm opened a new pull request #9111: Add HashJoinSegment, a virtual segment 
for joins.
URL: https://github.com/apache/druid/pull/9111
 
 
   An initial step towards #8728. This patch adds enough functionality to 
implement a joining
   cursor on top of a normal datasource. It does not include enough to actually 
do a query. For
   that, future patches will need to wire this low-level functionality into the 
query language.
   
   The main files in this patch:
   
   - HashJoinSegment: The virtual join Segment described in #8728.
   - HashJoinSegmentStorageAdapter: Storage adapter for that segment; 
"makeCursors" is the
     interesting part.
   - HashJoinEngine: Contains JoinColumnSelectorFactory, JoinCursor, which 
together implement
     the row-by-row logic of a join.
   - LookupJoinable: Allows joining onto lookups.
   - IndexedTableJoinable: A more flexible Joinable that can have multiple 
columns in general,
     including multiple key columns. I expect this will be used for joining 
onto subquery results
     in the future. It may even be used as a sort of super-lookup.
   
   Some supporting elements:
   
   - Added a "withDimension" method to DimensionSpec so prefixed dimensions can 
be rewritten to
     remove their prefixes.
   - Added "canIterate" and "iterable" to LookupExtractor, necessary for right 
and full joins
     on lookups. It will also be useful for direct queries on lookups in the 
future.
   - Removed "getSegmentIdentifier" method from StorageAdapter. It was not 
being used.
   - Moved RowBasedColumnSelectorFactory out of the groupBy engine, reflecting 
the fact that it
     has been used by other, non-groupBy things. Also, split out the RowAdapter 
interface, which
     is now used by RowBasedIndexedTable as well.
   - Renamed VectorColumnStrategizer to VectorColumnProcessorFactory (see 
below).
   - Added a "ColumnProcessors" utility class and "ColumnProcessorFactory" 
interface that is
     currently only used to make join condition matchers in 
IndexedTableJoinMatcher. It wasn't
     strictly necessary, but I think it's designed better than 
ColumnSelectorStrategyFactory,
     and could replace it in the future. It's similar in design to 
VectorColumnProcessorFactory.
   
   Next steps:
   
   - Implement the rest of "data server behavior" from 
https://github.com/apache/druid/issues/8728 (this patch is number 3, the 
virtual join Segment).
   - Implement "broker behavior" from 
https://github.com/apache/druid/issues/8728.
   - Implement SQL planning.
   - Various performance optimizations: filter push-down, deferred lookupName 
during condition matching / row retrieval, vectorized joins.
   - Fix handling of right-joins; see comment in HashJoinEngine: "Warning! The 
way this engine handles 'righty' joins is flawed: it generates the 'remainder' 
rows per-segment, but this should really be done globally. This should be 
improved in the future."


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm opened a new pull request #9111: Add HashJoinSegment, a virtual segment for joins.

Reply via email to