S Daniel Zafar created SPARK-31137: -------------------------------------- Summary: Opportunity to simplify execution plan when passing empty dataframes to subtract() Key: SPARK-31137 URL: https://issues.apache.org/jira/browse/SPARK-31137 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.4.5 Reporter: S Daniel Zafar
Execution plans are similar when passing an empty versus non-empty DataFrame to pyspark's subtract call. {code:java} df.subtract(regDf){code} yields the same physical plan as: {code:java} df.subtract(emptyDf){code} Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both DataFrames, this can yield some significant performance speed-ups because if the incoming DF is empty no processing should happen. Should be a quick fix for a seasoned commiter. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org