[jira] [Created] (SPARK-47998) pandas-on-spark DataFrame.concat will not join a Pandas dataframe and raises a misleading error

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47998:
---

 Summary: pandas-on-spark DataFrame.concat will not join a Pandas 
dataframe and raises a misleading error
 Key: SPARK-47998
 URL: https://issues.apache.org/jira/browse/SPARK-47998
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


The `concat` method has a strict type check, that raises a misleading error:

!image-2024-04-25-11-33-29-208.png!
Note that the type raised is of `objs`, rather than `obj`, so a list of various 
objects will say that it cannot concatenate objects of type list, rather than 
the failed internal types.

 

Additionally, this strictly checks for pandas-on-spark Series and DataFrames; 
since both objects will happily convert a naive Pandas object, something like

 

objs = [DataFrame(x) if isinstance(x, pd.Dataframe) else Series(x) if 
isinstance(x, pd.Series) else x for x in objs] 

would trivially make this work in those cases and prevent a different strange 
error reporting that a dataframe wasn't valid in a dataframe concatenation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47997) Pandas-on-Spark incompletely implements DataFrame.drop

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47997:
---

 Summary: Pandas-on-Spark incompletely implements DataFrame.drop
 Key: SPARK-47997
 URL: https://issues.apache.org/jira/browse/SPARK-47997
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


For Pandas v1.0+, `drop` supports the `errors` kwarg:

[https://pandas.pydata.org/pandas-docs/version/1.0/reference/api/pandas.DataFrame.drop.html]

 

Pandas-on-Spark does not implement it. This is especially glaring since the 
pyspark drop is a no-op on absent columns, behaving like `errors='ignore'`, so 
_extra_ work needed to be done to implement the raise behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47996) Pandas-on-Spark incompletely implements merge methods

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47996:
---

 Summary: Pandas-on-Spark incompletely implements merge methods
 Key: SPARK-47996
 URL: https://issues.apache.org/jira/browse/SPARK-47996
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


For Pandas >= 1.2 ( 
[https://pandas.pydata.org/pandas-docs/version/1.2/reference/api/pandas.DataFrame.merge.html]
 ) (current = 2.2) how implements method "cross". which is absent.

 

This breaks API compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39142) Type overloads in `pandas_udf`

2022-05-10 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-39142:
---

 Summary: Type overloads in `pandas_udf` 
 Key: SPARK-39142
 URL: https://issues.apache.org/jira/browse/SPARK-39142
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Philip Kahn


It seems that the `returnType` in the type overloads for `pandas_udf` never 
specify a generic for PySpark SQL types or explicitly list those types:

 

[https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi]

 

This results in static type checkers flagging the type of the decorated 
functions (and their parameters) as incorrect, see 
[https://github.com/microsoft/pylance-release/issues/2789] as an example.

 

For someone familiar with the code base, this should be a very fast patch.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30239) [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails

2019-12-12 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-30239:
---

 Summary: [Python] Creating a dataframe with Pandas rather than 
Numpy datatypes fails
 Key: SPARK-30239
 URL: https://issues.apache.org/jira/browse/SPARK-30239
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
 Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | 
Scala 2.11
Reporter: Philip Kahn


It's possible to work with DataFrames in Pandas and shuffle them back over to 
Spark dataframes for processing; however, using Pandas extended datatypes like 
{{Int64 }}( 
[https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) 
throws an error (that long / float can't be converted).

This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} 
allows only integers except for the single float value {{np.nan}}.

 

The current workaround for this is to use the columns as floats, and after 
conversion to the Spark DataFrame, to recast the column as {{LongType()}}. For 
example:

 

{{sdfC = spark.createDataFrame(kgridCLinked)}}

{{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}

 

However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org