[ https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-30239. ---------------------------------- Resolution: Incomplete Resolving by no feedback from reporter. > Creating a dataframe with Pandas rather than Numpy datatypes fails > ------------------------------------------------------------------ > > Key: SPARK-30239 > URL: https://issues.apache.org/jira/browse/SPARK-30239 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 > | Scala 2.11 > Reporter: Philip Kahn > Priority: Minor > > It's possible to work with DataFrames in Pandas and shuffle them back over to > Spark dataframes for processing; however, using Pandas extended datatypes > like {{Int64 }}( > [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) > throws an error (that long / float can't be converted). > This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} > allows only integers except for the single float value {{np.nan}}. > > The current workaround for this is to use the columns as floats, and after > conversion to the Spark DataFrame, to recast the column as {{LongType()}}. > For example: > > {{sdfC = spark.createDataFrame(kgridCLinked)}} > {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}} > > However, this is awkward and redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org