[ https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-27105: ----------------------------------- Assignee: Ivan Vergiliev > Prevent exponential complexity in ORC `createFilter` > ------------------------------------------------------ > > Key: SPARK-27105 > URL: https://issues.apache.org/jira/browse/SPARK-27105 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Ivan Vergiliev > Assignee: Ivan Vergiliev > Priority: Major > Labels: performance > > `OrcFilters.createFilters` currently has complexity that's exponential in the > height of the filter tree. There are multiple places in Spark that try to > prevent the generation of skewed trees so as to not trigger this behaviour, > for example: > - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` > combines a number of binary logical expressions into a balanced tree. > - https://github.com/apache/spark/pull/22313 introduced a change to > `OrcFilters` to create a balanced tree instead of a skewed tree. > However, the underlying exponential behaviour can still be triggered by code > paths that don't go through any of the tree balancing methods. For example, > if one generates a tree of `Column`s directly in user code, there's nothing > in Spark that automatically balances that tree and, hence, skewed trees hit > the exponential behaviour. We have hit this in production with jobs > mysteriously taking hours on the Spark driver with no worker activity, with > as few as ~30 OR filters. > I have a fix locally that makes the underlying logic have linear complexity > instead of exponential complexity. With this fix, the code can handle > thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org