[ https://issues.apache.org/jira/browse/SPARK-31137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
S Daniel Zafar updated SPARK-31137: ----------------------------------- Comment: was deleted (was: Moving this to Databricks internal board.) > Opportunity to simplify execution plan when passing empty dataframes to > subtract() > ---------------------------------------------------------------------------------- > > Key: SPARK-31137 > URL: https://issues.apache.org/jira/browse/SPARK-31137 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 2.4.5 > Reporter: S Daniel Zafar > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Execution plans are similar when passing an empty versus non-empty DataFrame > to pyspark's subtract call. > {code:java} > df.subtract(regDf){code} > yields the same physical plan as: > {code:java} > df.subtract(emptyDf){code} > Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both > DataFrames, this can yield some significant performance speed-ups because if > the incoming DF is empty no processing should happen. > > Should be a quick fix for a seasoned commiter. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org