Stephen Offer created SPARK-44336:
-------------------------------------

             Summary: Add Python inbuilt functions to DataFrame for ease of use 
for Python developers
                 Key: SPARK-44336
                 URL: https://issues.apache.org/jira/browse/SPARK-44336
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.4.1
            Reporter: Stephen Offer


Python developers are used to common inbuilt functions when developing but 
PySpark doesn't support any of the most used inbuilt functionality for 
DataFrames. PySpark already has this functionality for columns but not for the 
DataFrame itself. Adding this support for DataFrames would simplify some parts 
of development. For example:


{code:java}
if df == df1:       # DataFrame Equality 
if df != df2:       # DataFrame Inequality

df_large = df * 100 # Quickly make a larger dataframe through union of copies
                    # Very useful for performance testing

df_sub = df1 - df2  # Simple DataFrame subtraction
                    # Equivalent to df1.subtract(df2)

df4 = df + df1      # Equivalent to df.union(df1)

len(df)             # Equivalent to df.count()

for row in df:      # Equivalent to `for row in df.collect():`
    some_work(row)

if "company_name" in df: # Check if item is in the DataFrame

{code}
 

There is an ongoing DataFrame equality function effort in PR: 41833, I've also 
built my own.


These are suggestions, any other functions to be added or removed from this 
list can be discussed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to