Max Gekk created SPARK-57184:
--------------------------------
Summary: Null struct corrupts nested CalendarInterval column values
Key: SPARK-57184
URL: https://issues.apache.org/jira/browse/SPARK-57184
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
Assignee: Max Gekk
SPARK-56981 / the nanosecond-timestamp column-vector work surfaced a latent bug
in
WritableColumnVector.appendStruct(boolean isNull).
When a struct column is appended as NULL via appendStruct(true), the method
recurses
into child columns that are themselves struct-shaped (StructType, VariantType)
so that
their grandchild cursors stay aligned. A CalendarInterval child column is also
struct-shaped: it is backed by three grandchild primitive columns (months as
int,
days as int, microseconds as long). However, the recursion guard did not include
CalendarIntervalType, so an interval child took the else branch
(c.appendNull()),
which advances only the interval column's own cursor and leaves its three
grandchild
columns un-advanced.
As a result, for a struct column with a CalendarInterval field, appending a
NULL parent
row leaves the interval's grandchild cursors behind by one. A subsequent
non-null row
then writes its months/days/microseconds into the wrong (earlier) grandchild
slots, and
reading that row back returns a skewed/garbage interval value. This is silent
data
corruption for the nested struct-of-interval case.
Fix: include CalendarIntervalType in the recursion guard in appendStruct so
that a null
parent struct cascades appendStruct(true) into the interval child, advancing
all three
grandchild cursors.
This was split out of the nanosecond-timestamp ColumnVector PR (SPARK-57100)
per review,
since it is an independent fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]