wankunde opened a new pull request #32514: URL: https://github.com/apache/spark/pull/32514
### What changes were proposed in this pull request? This PR try to improve `InferFiltersFromConstraints` performance via avoid generating too many constraints. For example: ```java test("Expression explosion when analyze test") { RuleExecutor.resetMetrics() Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) .toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n") .write.saveAsTable("test") val df = spark.table("test") val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100") val df3 = df2.select('a as 'a1, 'b as 'b1, 'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1, 'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1, 'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1) val df4 = df3.join(df2, df3("a1") === df2("a")) df4.explain(true) logWarning(RuleExecutor.dumpTimeSpent()) } ``` ### Why are the changes needed? Improve `InferFiltersFromConstraints` performance Before this PR: ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 1187 Total time: 5.022786805 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 4528820409 / 4529498144 1 / 2 org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog 0 / 38907142 0 / 13 Combined[org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$InConversion, org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings, org.apache.spark.sql.catalyst.analysis.DecimalPrecision, org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$FunctionArgumentConversion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ConcatCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$MapZipWithCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$EltCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CaseWhenCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IfCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StackCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$Division, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IntegralDivision, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ImplicitType Casts, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$DateTimeOperations, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$WindowFrameCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StringLiteralCoercion] 0 / 30035714 0 / 13 org.apache.spark.sql.execution.datasources.SchemaPruning 0 / 20202429 0 / 2 org.apache.spark.sql.execution.datasources.PreprocessTableCreation 0 / 15898208 0 / 8 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 7497131 / 15098789 2 / 13 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 11633805 / 13755605 1 / 13 ``` After this PR: ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 1187 Total time: 0.559125361 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 44387973 / 45044872 1 / 2 org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog 0 / 40652311 0 / 13 Combined[org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$InConversion, org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings, org.apache.spark.sql.catalyst.analysis.DecimalPrecision, org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$FunctionArgumentConversion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ConcatCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$MapZipWithCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$EltCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CaseWhenCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IfCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StackCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$Division, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IntegralDivision, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ImplicitType Casts, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$DateTimeOperations, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$WindowFrameCoercion, org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StringLiteralCoercion] 0 / 30068620 0 / 13 org.apache.spark.sql.execution.datasources.SchemaPruning 0 / 20810353 0 / 2 org.apache.spark.sql.execution.datasources.PreprocessTableCreation 0 / 19485336 0 / 8 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 8476540 / 16209891 2 / 13 org.apache.spark.sql.execution.datasources.FindDataSourceTable 10826285 / 14306609 1 / 13 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 11935867 / 14163328 1 / 13 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Exists Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org