gianm commented on code in PR #14237:
URL: https://github.com/apache/druid/pull/14237#discussion_r1190215126
##########
processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinable.java:
##########
@@ -105,7 +106,7 @@ public ColumnValuesWithUniqueFlag
getNonNullColumnValues(String columnName, fina
try (final IndexedTable.Reader reader =
table.columnReader(columnPosition)) {
// Sorted set to encourage "in" filters that result from this method to
do dictionary lookups in order.
// The hopes are that this will improve locality and therefore improve
performance.
- final Set<String> allValues = createValuesSet();
+ final Set<String> allValues = createOrderedValuesSet();
Review Comment:
FYI, InDimFilter wants a sorted set internally, so if you give it a
non-sorted set, it'll need to copy it, which slows things down. So we really do
want to keep this being a sorted set.
Also, the comment is inaccurate, since InDimFilter always uses sorted sets
internally, since #12517. (See ValuesSet.) Could you update it? Something like
"Use a SortedSet so InDimFilter doesn't need to create its own".
##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -178,15 +183,37 @@ static JoinClauseToFilterConversion convertJoinToFilter(
}
numValues -= columnValuesWithUniqueFlag.getColumnValues().size();
- filters.add(Filters.toFilter(new InDimFilter(leftColumn,
columnValuesWithUniqueFlag.getColumnValues())));
+ // For each column value which are received in order we increment the
index
+ // and add it in the appropriate index in the map
+ int c = 0;
+ for (String val : columnValuesWithUniqueFlag.getColumnValues()) {
Review Comment:
The logic here is inefficient since it creates `filterMap` full of
`SelectorDimFilter` even if it ends up not being used (in the case where
`clause.getCondition().getEquiConditions().size() == 1`). We need to be careful
about this stuff, since the filters can be quite large.
Also I am not sure if it's correct. It looks like this is doing a zip
between all
`clause.getJoinable().getNonNullColumnValues(condition.getRightColumn(),
numValues)` for each equi-join condition. However, because
`getNonNullColumnValues` returns _unique_ values, the different column values
might not line up. I think if you add a test where the column values aren't all
unique, you'll see this effect.
IMO, it's OK to fix this bug by limiting the join-to-filter optimization to
the case where there is a single equi-join. It's not clear that the
optimization is actually helpful when there's more than one equi-join, since we
have to use a bunch of selectors with a bunch of boolean operations, rather
than an "in" filter. So, I'd suggest doing that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]