Yicong Huang created SPARK-55350:
------------------------------------
Summary: Convert from pandas to arrow loses row count when schema
has 0 columns
Key: SPARK-55350
URL: https://issues.apache.org/jira/browse/SPARK-55350
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
When creating an Arrow RecordBatch with 0 columns, the row count is lost due to
a PyArrow limitation.
{code:python}
import pyarrow as pa
# Creating batch with 0 columns loses row count
batch = pa.RecordBatch.from_arrays([], [])
print(batch.num_rows) # Always 0, regardless of input data
{code}
This affects pandas UDF serializers when the return type is an empty struct.
The row count information is lost during serialization.
In `ArrowStreamPandasSerializer.load_stream`, there is code to handle 0-column
batches:
{code:python}
if batch.num_columns == 0:
yield [pd.Series([pyspark._NoValue] * batch.num_rows)]
{code}
However, this doesn't help because `batch.num_rows` is already 0 when the batch
was created.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]