Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
github-actions[bot] closed pull request #45190: [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns URL: https://github.com/apache/spark/pull/45190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
github-actions[bot] commented on PR #45190: URL: https://github.com/apache/spark/pull/45190#issuecomment-2144078242 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
ahshahid opened a new pull request, #45190: URL: https://github.com/apache/spark/pull/45190 What changes were proposed in this pull request? On the lines of DPP which helps DataSourceV2 relations when the joining key is a partition column, the same concept can be extended over to the case where joining key is not a partition column. In this PR, the keys available in the BroadcastHashJoinExec are pushed down to the DataSourceV2 scans in form of a SortedSet structure. For non partition columns, the DataSources like iceberg have max/min stats on columns available at manifest level, and for formats like parquet , they have max/min stats at various data structure levels. The passed SortedSet can be used to prune using ranges at both driver level ( manifests files) as well as executor level ( while actually going through chunks , row groups etc at parquet level) If the data is stored as Columnar Batch format , then it would not be possible to filter out individual row at DataSource level, even though we have keys. But at the scan level, ( ColumnToRowExec) it is still possible to filter out as many rows as possible , if the query involves nested joins. Thus reducing the number of rows to join at the higher join levels. Attaching link to a presentation which outlines the idea: [Broadcast Keys pushdown](https://docs.google.com/presentation/d/165Rx7i00TmAKnDJpSQLfrcrW-ShrzPy5/edit?usp=drive_link) SPIP : [SPIP-44662](https://issues.apache.org/jira/browse/SPARK-44662) Why are the changes needed? There is scope of improvement in the performance of Inner and Left Semi join queries when using BroadcastHashJoin Does this PR introduce any user-facing change? No How was this patch tested? Ran TPCDS suite using iceberg as DataSource. Converted many of the existing Spark Query tests to also run using iceberg as data source. Will be adding more unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
ahshahid closed pull request #43373: [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns URL: https://github.com/apache/spark/pull/43373 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
ahshahid commented on PR #43373: URL: https://github.com/apache/spark/pull/43373#issuecomment-1955564018 Will be closing this PR and creating a new one as I have renamed the branch on which this PR has been created -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
github-actions[bot] closed pull request #42350: [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns URL: https://github.com/apache/spark/pull/42350 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
github-actions[bot] commented on PR #42350: URL: https://github.com/apache/spark/pull/42350#issuecomment-1817695897 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
ahshahid opened a new pull request, #43373: URL: https://github.com/apache/spark/pull/43373 What changes were proposed in this pull request? On the lines of DPP which helps DataSourceV2 relations when the joining key is a partition column, the same concept can be extended over to the case where joining key is not a partition column. In this PR, the keys available in the BroadcastHashJoinExec are pushed down to the DataSourceV2 scans in form of a SortedSet structure. For non partition columns, the DataSources like iceberg have max/min stats on columns available at manifest level, and for formats like parquet , they have max/min stats at various data structure levels. The passed SortedSet can be used to prune using ranges at both driver level ( manifests files) as well as executor level ( while actually going through chunks , row groups etc at parquet level) If the data is stored as Columnar Batch format , then it would not be possible to filter out individual row at DataSource level, even though we have keys. But at the scan level, ( ColumnToRowExec) it is still possible to filter out as many rows as possible , if the query involves nested joins. Thus reducing the number of rows to join at the higher join levels. Attaching link to a presentation which outlines the idea: [Broadcast Keys pushdown](https://docs.google.com/presentation/d/165Rx7i00TmAKnDJpSQLfrcrW-ShrzPy5/edit?usp=drive_link) SPIP : [SPIP-44662](https://issues.apache.org/jira/browse/SPARK-44662) Why are the changes needed? There is scope of improvement in the performance of Inner and Left Semi join queries when using BroadcastHashJoin Does this PR introduce any user-facing change? No How was this patch tested? Ran TPCDS suite using iceberg as DataSource. Converted many of the existing Spark Query tests to also run using iceberg as data source. Will be adding more unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org