Nic Crane created SPARK-35814:
---------------------------------

             Summary: Mismatched types when creating a new column when using 
Arrow
                 Key: SPARK-35814
                 URL: https://issues.apache.org/jira/browse/SPARK-35814
             Project: Spark
          Issue Type: Bug
          Components: R
    Affects Versions: 3.1.2
            Reporter: Nic Crane


I was looking into a question on StackOverflow where a user had an error due to 
a mismatch of field types  
(https://stackoverflow.com/questions/68032084/how-to-fix-error-in-readbin-using-arrow-package-with-sparkr/).
  I've made a reproducible example below.

 
{code:java}
library(SparkR)

library(arrow)

SparkR::sparkR.session(sparkConfig = 
list(spark.sql.execution.arrow.sparkr.enabled = "true"))

readr::write_csv(iris, "iris.csv")dfSchema <- structType(
  structField("Sepal.Length", "int"),
  structField("Sepal.Width", "double"),
  structField("Petal.Length", "double"),
  structField("Petal.Width", "double"),
  structField("Species", "string")
)

spark_df <- SparkR::read.df(path="iris.csv", source="csv", schema=dfSchema)

# Apply an R native function to each partition.
returnSchema <-  structType(structField("Sep_add_one", "int"))

df <- SparkR::dapply(spark_df, function(rdf){
  data.frame(rdf$Sepal.Length + 1)
  }, returnSchema
)

collect(df)
{code}
It results in this error:
{code:java}
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : 
  invalid 'n' argument
{code}
I've never used SparkR before, so could be wrong, but I *think* it may be due 
to the use of this line in DataFrame.R:
{code:java}
arrowTable <- arrow::read_ipc_stream(readRaw(conn))
{code}
readRaw assumes that the the value is an integer; however when returnSchema is 
being created, the code creates a double (the code runs without error if it's 
updated to `rdf$Sepal.Length + 1L`)

I wasn't sure if perhaps readTypedObject needs to be used or if maybe some type 
checking might be useful?

Apologies for the partial explanation - not entirely sure of the cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to