Max Gekk created SPARK-57033:
--------------------------------
Summary: Add java.time LocalDateTime/Instant conversion and
Dataset roundtrip for nanosecond timestamps
Key: SPARK-57033
URL: https://issues.apache.org/jira/browse/SPARK-57033
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.2.0
Reporter: Max Gekk
h2. Summary
Add conversion between {{java.time.LocalDateTime}} / {{java.time.Instant}} and
the SPIP composite internal representation {{(epochMicros, nanosWithinMicro)}},
and wire Spark's encoder/converter stack so *Dataset* create/collect roundtrips
preserve *nanosecond* precision end-to-end.
h2. Background
Microsecond timestamps already roundtrip through:
* {{CatalystTypeConverters}} -- {{Instant}} ↔ micro {{Long}}
({{TimestampType}}), {{LocalDateTime}} ↔ micro {{Long}} ({{TimestampNTZType}})
* {{ExpressionEncoder}} / {{RowEncoderSuite}} -- encode-decode tests for
{{Instant}} and {{LocalDateTime}}
* {{Encoders.INSTANT()}}, {{Encoders.LOCALDATETIME()}},
{{JavaDatasetSuite.testLocalDateTimeEncoder}} -- Dataset create/collect
Nanosecond-capable types store values as {{TimestampNTZNanos}} /
{{TimestampLTZNanos}} (composite layout from
[SPARK-56981|https://github.com/apache/spark/pull/56059]). There is *no*
{{CatalystTypeConverters}} path yet for {{TimestampNTZNanosType(p)}} /
{{TimestampLTZNanosType(p)}}, and no end-to-end Dataset test proving sub-micro
precision survives create → internal row → collect.
Sub-task *1a* (string parsing) is a separate ticket; this ticket uses
{{java.time}} objects as the primary input/output surface for conversion and
Dataset tests.
h2. Scope
h3. 1. Composite conversion helpers
Add package-private helpers (extend {{SparkDateTimeUtils}} / {{DateTimeUtils}}
or a small dedicated module), e.g.:
* {{localDateTimeToTimestampNTZNanos(ldt: LocalDateTime): TimestampNTZNanos}}
* {{timestampNTZNanosToLocalDateTime(v: TimestampNTZNanos): LocalDateTime}}
* {{instantToTimestampLTZNanos(instant: Instant): TimestampLTZNanos}} (absolute
instant → SPIP pair; LTZ semantics via zone rules where applicable)
* {{timestampLTZNanosToInstant(v: TimestampLTZNanos): Instant}}
*Normalization invariant:* {{nanosWithinMicro}} always in \[0, 999\]; use
{{Math.addExact}} for carry from sub-micro remainder into {{epochMicros}}.
*Precision:* helpers produce full nanosecond resolution; truncation to declared
type precision *p* ∈ \[7, 9\] is applied at cast/schema boundaries (future cast
ticket), not silently in the base conversion unless documented otherwise.
h3. 2. CatalystTypeConverters
Register converters for nanos logical types (mirror {{TimestampNTZConverter}} /
{{InstantConverter}}):
* {{TimestampNTZNanosType(_)}} + {{LocalDateTime}} (NTZ wall-clock)
* {{TimestampLTZNanosType(_)}} + {{Instant}} when
{{spark.sql.datetime.java8API.enabled=true}} (LTZ instant timeline)
Wire into {{createToCatalystConverter}}, {{createToScalaConverter}}, and
{{convertToCatalyst}} special cases for {{LocalDateTime}} / {{Instant}} when
the target schema column is a nanos timestamp type.
h3. 3. Encoder / deserializer plumbing
Ensure {{ExpressionEncoder}} over a {{StructType}} with
{{TimestampNTZNanosType(p)}} / {{TimestampLTZNanosType(p)}} columns can
serialize and deserialize {{Row}} values holding {{LocalDateTime}} /
{{Instant}}:
* {{SerializerBuildHelper}} / {{DeserializerBuildHelper}} as needed (follow
{{LocalDateTimeEncoder}} / {{InstantEncoder}} patterns for micro types)
* {{Row}} / {{GenericInternalRow}} paths use {{getTimestampNTZNanos}} /
{{setTimestampNTZNanos}} (physical accessors from SPARK-56981)
h3. 4. End-to-end Dataset tests (required)
Add integration tests that *create Datasets* from {{java.time}} values with
*nanosecond* fractional parts and {{collectAsList()}} / {{collect()}} back with
*exact* nanosecond equality:
* *Scala* ({{sql/core}}): {{Dataset[Row]}} or case-class rows with schema
{{TimestampNTZNanosType(9)}} / {{TimestampLTZNanosType(9)}}; values like
{{LocalDateTime.of(2019, 2, 26, 16, 56, 0, 123456789)}} and
{{Instant.parse("2019-02-26T16:56:00.123456789Z")}}
* *Java* ({{JavaDatasetSuite}} or new suite): same roundtrip via
{{spark.createDataset(..., encoder)}}
* Assert sub-micro digits preserved (not truncated to micros like today's
{{TimestampNTZType}} path)
* Include null column, multiple rows, and at least one edge instant from the
datetime corpus (epoch, pre-1900, max range)
*Unit tests* (catalyst, faster feedback):
* {{CatalystTypeConvertersSuite}} -- {{LocalDateTime}} / {{Instant}} ↔
{{TimestampNTZNanos}} / {{TimestampLTZNanos}} roundtrip
* {{RowEncoderSuite}} -- encode/decode rows with nanos timestamp columns
(mirror existing SPARK-35664 micro tests)
h2. Implementation notes
* Reuse {{LocalDateTime.getNano()}} / {{Instant.getNano()}} decomposition:
whole seconds → {{epochMicros}} grid, nano-of-second → {{nanosWithinMicro}} +
micro carry.
* Do *not* route through micro {{Long}} internally (that would lose sub-micro
digits); convert directly between {{java.time}} and {{TimestampNTZNanos}} /
{{TimestampLTZNanos}}.
* Existing {{TimestampNTZType}} / {{TimestampType}} converter behavior must
remain unchanged.
* Follow {{TimeType}} precedent: nanosecond precision preserved in catalyst for
{{LocalTime}}; same expectation for nanos timestamps.
h2. Acceptance criteria
* Conversion helpers roundtrip {{java.time}} ↔ composite values without
sub-micro loss for valid instants in supported range.
* {{CatalystTypeConvertersSuite}} and {{RowEncoderSuite}} pass with new
nanosecond cases.
* End-to-end Dataset tests (Scala + Java) create from {{LocalDateTime}} /
{{Instant}} with nanosecond fractions and collect back with {{equals}} match on
full nanosecond field.
* No regression in existing micro timestamp encoder/converter tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]