[jira] [Created] (SPARK-47998) pandas-on-spark DataFrame.concat will not join a Pandas dataframe and raises a misleading error
Philip Kahn created SPARK-47998: --- Summary: pandas-on-spark DataFrame.concat will not join a Pandas dataframe and raises a misleading error Key: SPARK-47998 URL: https://issues.apache.org/jira/browse/SPARK-47998 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.3 Reporter: Philip Kahn The `concat` method has a strict type check, that raises a misleading error: !image-2024-04-25-11-33-29-208.png! Note that the type raised is of `objs`, rather than `obj`, so a list of various objects will say that it cannot concatenate objects of type list, rather than the failed internal types. Additionally, this strictly checks for pandas-on-spark Series and DataFrames; since both objects will happily convert a naive Pandas object, something like objs = [DataFrame(x) if isinstance(x, pd.Dataframe) else Series(x) if isinstance(x, pd.Series) else x for x in objs] would trivially make this work in those cases and prevent a different strange error reporting that a dataframe wasn't valid in a dataframe concatenation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47997) Pandas-on-Spark incompletely implements DataFrame.drop
Philip Kahn created SPARK-47997: --- Summary: Pandas-on-Spark incompletely implements DataFrame.drop Key: SPARK-47997 URL: https://issues.apache.org/jira/browse/SPARK-47997 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.3 Reporter: Philip Kahn For Pandas v1.0+, `drop` supports the `errors` kwarg: [https://pandas.pydata.org/pandas-docs/version/1.0/reference/api/pandas.DataFrame.drop.html] Pandas-on-Spark does not implement it. This is especially glaring since the pyspark drop is a no-op on absent columns, behaving like `errors='ignore'`, so _extra_ work needed to be done to implement the raise behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47996) Pandas-on-Spark incompletely implements merge methods
Philip Kahn created SPARK-47996: --- Summary: Pandas-on-Spark incompletely implements merge methods Key: SPARK-47996 URL: https://issues.apache.org/jira/browse/SPARK-47996 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.3 Reporter: Philip Kahn For Pandas >= 1.2 ( [https://pandas.pydata.org/pandas-docs/version/1.2/reference/api/pandas.DataFrame.merge.html] ) (current = 2.2) how implements method "cross". which is absent. This breaks API compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39142) Type overloads in `pandas_udf`
Philip Kahn created SPARK-39142: --- Summary: Type overloads in `pandas_udf` Key: SPARK-39142 URL: https://issues.apache.org/jira/browse/SPARK-39142 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1 Reporter: Philip Kahn It seems that the `returnType` in the type overloads for `pandas_udf` never specify a generic for PySpark SQL types or explicitly list those types: [https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi] This results in static type checkers flagging the type of the decorated functions (and their parameters) as incorrect, see [https://github.com/microsoft/pylance-release/issues/2789] as an example. For someone familiar with the code base, this should be a very fast patch. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30239) [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails
Philip Kahn created SPARK-30239: --- Summary: [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails Key: SPARK-30239 URL: https://issues.apache.org/jira/browse/SPARK-30239 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.3 Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11 Reporter: Philip Kahn It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes for processing; however, using Pandas extended datatypes like {{Int64 }}( [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) throws an error (that long / float can't be converted). This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} allows only integers except for the single float value {{np.nan}}. The current workaround for this is to use the columns as floats, and after conversion to the Spark DataFrame, to recast the column as {{LongType()}}. For example: {{sdfC = spark.createDataFrame(kgridCLinked)}} {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}} However, this is awkward and redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org