Yicong Huang created SPARK-55056:
------------------------------------

             Summary: toPandas() crashes with SIGSEGV on nested empty arrays
                 Key: SPARK-55056
                 URL: https://issues.apache.org/jira/browse/SPARK-55056
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


{{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array 
types (depth >= 3) with an empty outer array.

{code:python}
schema = StructType([
    StructField("data", ArrayType(StructType([
        StructField("arr", ArrayType(StructType([
            StructField("inner", ArrayType(StringType()))
        ])))
    ])))
])
df = spark.createDataFrame([Row(data=[])], schema=schema)
df.toPandas()  # SIGSEGV
{code}

Arrow format requires ListArray offset buffer to have N+1 entries. Even when 
N=0, the buffer must contain {{\[0\]}}. When the outer array is empty, nested 
{{ArrayWriter}}s are never invoked, so their {{count}} stays 0. Then 
{{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC 
serialization — violating Arrow spec.

{code:scala}
// ArrayWriter.scala - current behavior
override def setValue(...): Unit = {
  while (i < array.numElements()) {  // never runs when empty
    elementWriter.write(array, i)    // nested writer never called
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to