Max Gekk created SPARK-57032:
--------------------------------
Summary: Extend timestamp string parsing for nanosecond fractional
precision (p in 7–9)
Key: SPARK-57032
URL: https://issues.apache.org/jira/browse/SPARK-57032
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.2.0
Reporter: Max Gekk
h2. Summary
Extend Spark's existing timestamp string parsing to preserve fractional seconds
beyond microsecond precision and produce normalized nanosecond-capable internal
values {{(epochMicros, nanosWithinMicro)}} for {{TimestampNTZNanosType(p)}} and
{{TimestampLTZNanosType(p)}} where *p* is in \[7, 9\].
This is the first sub-task of the nanosecond datetime conversion utilities work
under [SPARK-56822|https://issues.apache.org/jira/browse/SPARK-56822] (SPIP:
Timestamps with nanosecond precision). It must *extend* current parsers, not
duplicate them.
h2. Background
Today Spark parses timestamp strings via
{{SparkDateTimeUtils.parseTimestampString}} and downstream helpers
({{stringToTimestamp}}, {{stringToTimestampWithoutTimeZone}},
{{TimestampFormatter}}). The fractional-second segment is stored in
{{segments(6)}} but:
* Input with *more than 6* fractional digits is *truncated* to microseconds
(comment at line 623: "loss of precision").
* Results are returned as a single *microsecond* {{Long}}.
Logical types {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} and
physical values {{TimestampNTZNanos}} / {{TimestampLTZNanos}} already exist
([SPARK-56876|https://issues.apache.org/jira/browse/SPARK-56876], [PR #56059 /
SPARK-56981|https://github.com/apache/spark/pull/56059]). SQL type syntax
{{TIMESTAMP_NTZ(p)}} / {{TIMESTAMP_LTZ(p)}} is parsed
([SPARK-56965|https://github.com/apache/spark/commit/4bbf75e9e672ccbcf762f5c7258be501b0ea7f5a]).
Without this change, string inputs with 7-9 fractional digits cannot be
converted to the SPIP composite representation.
Per SPIP, internal layout is:
* *epochMicros* -- signed epoch microseconds (same grid as existing timestamps)
* *nanosWithinMicro* -- {{short}} in *\[0, 999\]* (sub-micro remainder,
normalized)
h2. Scope
# *Extend fractional-second handling in {{parseTimestampString}} (or a shared
helper it calls)*
#* Accept *1-9* fractional digits (maintain backward-compatible behavior for
<=6 digits used by micro types).
#* For nanos parsing APIs: derive {{epochMicros}} + {{nanosWithinMicro}}
without loss for digits 7-9.
#* Apply *precision p* rules: digits beyond *p* truncate or round per SPIP
(document choice; align with future cast behavior).
#* Preserve existing accepted formats (ISO-8601 variants, space/{{T}}
separator, optional zone suffix) documented on {{parseTimestampString}}.
# *Add new package-private parse entry points*, e.g.:
#* {{stringToTimestampNTZNanos(s: UTF8String, precision: Int):
Option[TimestampNTZNanos]}}
#* {{stringToTimestampLTZNanos(s: UTF8String, precision: Int, timeZoneId:
ZoneId): Option[TimestampLTZNanos]}}
#* ANSI variants that throw on invalid input (mirror {{stringToTimestampAnsi}}).
# *Normalization invariant:* output always satisfies {{nanosWithinMicro}} in
\[0, 999\]; carry into {{epochMicros}} when needed.
# *Tests* (new suite, e.g. {{TimestampNanosParseSuite}} in {{sql/catalyst}}):
#* 7-, 8-, and 9-digit fractions at each precision *p*
#* Edge cases: {{.0}}, {{.999999999}}, trailing zeros, exactly 6 digits
(micro-compatible path)
#* NTZ vs LTZ (session / explicit zone)
#* Invalid: 10+ fractional digits, fractional part out of range after
normalization
#* Regression: existing {{TimestampFormatterSuite}} / micro parse tests
unchanged
h2. Implementation notes
* Prefer extending {{SparkDateTimeUtils.parseTimestampString}} rather than a
second parser; micro paths should keep current behavior (6-digit cap + micro
{{Long}}).
* Reuse {{TimestampNTZNanos}} validation ({{nanosWithinMicro}} in \[0, 999\]).
* Consider using {{java.time}} ({{LocalDateTime.of(..., nanoOfSecond)}})
internally for date/time + fraction assembly, then decompose into
{{(epochMicros, nanosWithinMicro)}} -- consistent with existing
{{stringToTimestamp}} structure.
* Do *not* change behavior of existing {{TimestampType}} / {{TimestampNTZType}}
string parsing.
h2. Acceptance criteria
* New parse APIs return normalized {{TimestampNTZNanos}} /
{{TimestampLTZNanos}} for valid strings with up to 9 fractional digits.
* Precision *p* in {7, 8, 9} enforced on excess digits per documented
truncate/round rule.
* Existing microsecond parse tests pass without modification.
* New unit tests cover NTZ/LTZ, time-zone suffixes, and edge-case corpus
(epoch, 1582 cutover, 9999 end range) with sub-micro fractions.
h2. Dependencies
* *Requires:* [SPARK-56981|https://github.com/apache/spark/pull/56059] merged
(physical value types).
* *Blocks:* string-to-nanos cast matrix, typed SQL literals with sub-micro
values, Parquet/string ingest tests, {{RandomDataGenerator}} {{specialTs}}
corpus.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]