S Daniel Zafar created SPARK-31137:
--------------------------------------

             Summary: Opportunity to simplify execution plan when passing empty 
dataframes to subtract()
                 Key: SPARK-31137
                 URL: https://issues.apache.org/jira/browse/SPARK-31137
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 2.4.5
            Reporter: S Daniel Zafar


Execution plans are similar when passing an empty versus non-empty DataFrame to 
pyspark's subtract call.
{code:java}
df.subtract(regDf){code}
yields the same physical plan as:
{code:java}
df.subtract(emptyDf){code}
 Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both 
DataFrames, this can yield some significant performance speed-ups because if 
the incoming DF is empty no processing should happen.

 

Should be a quick fix for a seasoned commiter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to