Yicong Huang created SPARK-55350:
------------------------------------

             Summary: Convert from pandas to arrow loses row count when schema 
has 0 columns
                 Key: SPARK-55350
                 URL: https://issues.apache.org/jira/browse/SPARK-55350
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


When creating an Arrow RecordBatch with 0 columns, the row count is lost due to 
a PyArrow limitation.

{code:python}
import pyarrow as pa

# Creating batch with 0 columns loses row count
batch = pa.RecordBatch.from_arrays([], [])
print(batch.num_rows)  # Always 0, regardless of input data
{code}

This affects pandas UDF serializers when the return type is an empty struct. 
The row count information is lost during serialization.

In `ArrowStreamPandasSerializer.load_stream`, there is code to handle 0-column 
batches:

{code:python}
if batch.num_columns == 0:
    yield [pd.Series([pyspark._NoValue] * batch.num_rows)]
{code}

However, this doesn't help because `batch.num_rows` is already 0 when the batch 
was created.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to