gianm commented on issue #7904: SQL planning slow with huge IN filters
URL: 
https://github.com/apache/incubator-druid/issues/7904#issuecomment-509011366
 
 
   I did some looking into this and found not just one, but a few different 
root causes.
   
   1. Calcite has an O(N^2) OR simplification step in `RexSimplify.simplifyOr`: 
https://issues.apache.org/jira/browse/CALCITE-3178. This was the biggest 
contributor, and a hacky workaround of simply disabling this step cut the 
planning time down to ~8s.
   2. Druid's `Expressions.toSimpleLeafFilter` unconditionally attempts to 
check every "leaf" filter (non-boolean) to see if it's a time floor filter, for 
purposes of potentially converting `FLOOR(__time TO DAY) = X` into a range 
filter. This imposes a few seconds of overhead, which could be eliminated by 
only checking leaf filters that look like they _might_ be time floors (their 
operator matches `FLOOR`, `TIME_FLOOR`, or `CAST(x AS DATE)`).
   3. Druid's `CombineAndSimplifyBounds` builds a `TreeRangeSet<BoundValue>` 
out of every equality or bound leaf filter, to see if they can be simplified. 
Algorithmic complexity looks fine from a quick glance at the method, but there 
is overhead imposed by the fact that `BoundValue.compareTo`, which is called 
internally and often by the TreeRangeSet, always compares bounds as Strings. 
For numeric bounds involves a lot of wasteful converting of numbers and strings 
back and forth. This is only an issue when the bounds are numeric, but, in the 
case of my test query, they were (it was a long-typed column).
   4. Calcite's parser also seems to take a while (1–2s) to read through the 
SQL. Possibly #6974 would help with that, if it allows constructs like `col IN 
(?)` where `?` is an array.
   
   Profiling indicates that if all of the above were addressed, query planning 
time should come down to more acceptable levels.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to