[GitHub] [druid] gianm commented on a diff in pull request #14237: Updates to filter processing for inner query in Joins

via GitHub Wed, 10 May 2023 10:49:09 -0700


gianm commented on code in PR #14237:
URL: https://github.com/apache/druid/pull/14237#discussion_r1190215126



##########
processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinable.java:
##########
@@ -105,7 +106,7 @@ public ColumnValuesWithUniqueFlag 
getNonNullColumnValues(String columnName, fina
     try (final IndexedTable.Reader reader = 
table.columnReader(columnPosition)) {
       // Sorted set to encourage "in" filters that result from this method to 
do dictionary lookups in order.
       // The hopes are that this will improve locality and therefore improve 
performance.
-      final Set<String> allValues = createValuesSet();
+      final Set<String> allValues = createOrderedValuesSet();

Review Comment:
   FYI, InDimFilter wants a sorted set internally, so if you give it a 
non-sorted set, it'll need to copy it, which slows things down. So we really do 
want to keep this being a sorted set.
   
   Also, the comment is inaccurate, since InDimFilter always uses sorted sets 
internally, since #12517. (See ValuesSet.) Could you update it? Something like 
"Use a SortedSet so InDimFilter doesn't need to create its own".



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -178,15 +183,37 @@ static JoinClauseToFilterConversion convertJoinToFilter(
         }
 
         numValues -= columnValuesWithUniqueFlag.getColumnValues().size();
-        filters.add(Filters.toFilter(new InDimFilter(leftColumn, 
columnValuesWithUniqueFlag.getColumnValues())));
+        // For each column value which are received in order we increment the 
index
+        // and add it in the appropriate index in the map
+        int c = 0;
+        for (String val : columnValuesWithUniqueFlag.getColumnValues()) {

Review Comment:
   The logic here is inefficient since it creates `filterMap` full of 
`SelectorDimFilter` even if it ends up not being used (in the case where 
`clause.getCondition().getEquiConditions().size() == 1`). We need to be careful 
about this stuff, since the filters can be quite large.
   
   Also I am not sure if it's correct. It looks like this is doing a zip 
between all 
`clause.getJoinable().getNonNullColumnValues(condition.getRightColumn(), 
numValues)` for each equi-join condition. However, because 
`getNonNullColumnValues` returns _unique_ values, the different column values 
might not line up. I think if you add a test where the column values aren't all 
unique, you'll see this effect.
   
   IMO, it's OK to fix this bug by limiting the join-to-filter optimization to 
the case where there is a single equi-join. It's not clear that the 
optimization is actually helpful when there's more than one equi-join, since we 
have to use a bunch of selectors with a bunch of boolean operations, rather 
than an "in" filter. So, I'd suggest doing that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on a diff in pull request #14237: Updates to filter processing for inner query in Joins

Reply via email to