[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Thank you for review and merging, @liancheng , @cloud-fan , @hvanhovell , and @naliazheli ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 Merged to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14044 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14044 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14044 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61744/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14044 **[Test build #61744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61744/consoleFull)** for PR 14044 at commit [`45eb28a`](https://github.com/apache/spark/commit/45eb28af51203a97c22c8b9022cb38ac0451d401). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Now, I update the title and description of PR/JIRA. The only patch in this PR is the following one word change. ``` -new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema)) +new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema)) ``` Thank you all for fast review & advice. At first commit, I thought it is important to remove all repeating logics. But, now only the minimum meaningful code change remains. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14044 **[Test build #61744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61744/consoleFull)** for PR 14044 at commit [`45eb28a`](https://github.com/apache/spark/commit/45eb28af51203a97c22c8b9022cb38ac0451d401). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Hi, @cloud-fan , @hvanhovell , @liancheng . According to @cloud-fan 's advice, after changing the following, it turns out that the difference is not noticeable. ``` -new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true) +new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema)) ``` Exactly as you guys told, the second call of `qe.assertAnalyzed()` is not the root cause. The only difference resides on `sparkSession.sessionState.executePlan(logicalPlan)`. I'll update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Thank you for review, @liancheng . I'm sure that the performance of Analyzer need to be improved. But, in any cases, the cost of analyzer cannot be zero. We should skip the redundant analysis. IMO, that idea sounds orthogonal to this PR. So, I asked @hvanhovell to make a PR for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 Agree with @hvanhovell. Analysis should never take so long a time for such a simple query. We should avoid duplicated analysis work, but fixing performance issue(s) within the analyzer seems to be more resultful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Yep. I agree. Could you make a PR for that? I think we also have some optimization points about that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14044 `LogicalPlan.resolve(...)` uses a linear search to resolve a column. This is pretty bad if you are trying to lookup 4000 columns 4 times (filter, project, aggregate, sort): 4000 * (4000 / 2) * 4 = 32.000.000 lookups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Interesting result. We definitely need to take a look at `ResolveReferences`-related stuff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Oh, thank you for advice of `dumpTimeSpent`. I didn't look at in that way. In these days, I'm trying to investigate large queries situation. This analysis is very helpful for me. Thank you so much, @hvanhovell . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14044 @dongjoon-hyun my point is that analysis should not be taking 12 seconds at all. You can see how much time is spent in a rule, if you add the following lines of code to your example: ```scala import org.apache.spark.sql.catalyst.rules.RuleExecutor println(RuleExecutor.dumpTimeSpent) ``` This yields the following result (timing in ns): ``` org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 18784486408 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 505619796 org.apache.spark.sql.catalyst.analysis.TypeCoercion$PropagateTypes 195027905 org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability 118882430 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences 74401505 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics 40068476 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer 32929965 org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator 30524660 org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 30453770 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions 28383135 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame 26168955 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder 25736499 org.apache.spark.sql.catalyst.analysis.TimeWindowing 24807670 org.apache.spark.sql.catalyst.analysis.DecimalPrecision 24000260 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery 21653219 org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion 20830229 org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings 19183636 org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion 17849664 org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality 15186886 org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion 13994296 org.apache.spark.sql.catalyst.analysis.TypeCoercion$Division 13929023 org.apache.spark.sql.catalyst.analysis.TypeCoercion$DateTimeOperations 13468710 org.apache.spark.sql.catalyst.analysis.CleanupAliases 13210810 org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringToIntegralCasts 13191046 org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic 11310837 org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF 10712897 org.apache.spark.sql.catalyst.analysis.TypeCoercion$CaseWhenCoercion 10589030 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases 7172334 org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions 5994564 org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution 5914136 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy 5303578 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin 4060244 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot 3174805 org.apache.spark.sql.catalyst.analysis.EliminateUnions 2787433 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate 2731683 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 2624228 org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates 2417768 org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution 2368503 org.apache.spark.sql.execution.datasources.PreprocessTableInsertion 2126155 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance 2059795 org.apache.spark.sql.execution.datasources.DataSourceAnalysis 1944978 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast 1912039 org.apache.spark.sql.execution.datasources.ResolveDataSource 1896232 org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes 1623414 org.apache.spark.sql.execution.datasources.FindDataSourceTable 1623004 ``` I think we should take a look at `ResolveReferences`. I do think your PR has merit; we really shouldn't be analyzing the same plan twice. --- If your project is set up for it, you can reply to this email
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Thank you for review, @hvanhovell . BTW, it's over 12 seconds for one single analysis. Elapsed time: 25.787751452s --> Elapsed time: 12.364812255s. The reason I executed `time(sql(query))` two times is that SQL parser and other overhead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14044 Any idea what causes the regression? 5 seconds seems way too long for analysis... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Thank you for review, @naliazheli . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user naliazheli commented on the issue: https://github.com/apache/spark/pull/14044 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14044 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14044 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61714/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14044 **[Test build #61714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61714/consoleFull)** for PR 14044 at commit [`1402a9d`](https://github.com/apache/spark/commit/1402a9d21cdce66158858560902571a9d91ac2fa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 cc @cloud-fan , too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14044 Hi, @liancheng and @rxin . Could you review this PR? This code path occurs during Dataset/Dataframe merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14044 **[Test build #61714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61714/consoleFull)** for PR 14044 at commit [`1402a9d`](https://github.com/apache/spark/commit/1402a9d21cdce66158858560902571a9d91ac2fa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org