Ivan Vergiliev created SPARK-27105:
--------------------------------------

             Summary: Prevent exponential complexity in ORC `createFilter`  
                 Key: SPARK-27105
                 URL: https://issues.apache.org/jira/browse/SPARK-27105
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Ivan Vergiliev


`OrcFilters.createFilters` currently has complexity that's exponential in the 
height of the filter tree. There are multiple places in Spark that try to 
prevent the generation of skewed trees so as to not trigger this behaviour, for 
example:
- `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` combines 
a number of binary logical expressions into a balanced tree.
- https://github.com/apache/spark/pull/22313 introduced a change to 
`OrcFilters` to create a balanced tree instead of a skewed tree.

However, the underlying exponential behaviour can still be triggered by code 
paths that don't go through any of the tree balancing methods. For example, if 
one generates a tree of `Column`s directly in user code, there's nothing in 
Spark that automatically balances that tree and, hence, skewed trees hit the 
exponential behaviour. We have hit this in production with jobs mysteriously 
taking hours on the Spark driver with no worker activity, with as few as ~30 OR 
filters.

I have a fix locally that makes the underlying logic have linear complexity 
instead of exponential complexity. With this fix, the code can handle thousands 
of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to