Max Gekk created SPARK-57032:
--------------------------------

             Summary: Extend timestamp string parsing for nanosecond fractional 
precision (p in 7–9)
                 Key: SPARK-57032
                 URL: https://issues.apache.org/jira/browse/SPARK-57032
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Max Gekk


h2. Summary

Extend Spark's existing timestamp string parsing to preserve fractional seconds 
beyond microsecond precision and produce normalized nanosecond-capable internal 
values {{(epochMicros, nanosWithinMicro)}} for {{TimestampNTZNanosType(p)}} and 
{{TimestampLTZNanosType(p)}} where *p* is in \[7, 9\].

This is the first sub-task of the nanosecond datetime conversion utilities work 
under [SPARK-56822|https://issues.apache.org/jira/browse/SPARK-56822] (SPIP: 
Timestamps with nanosecond precision). It must *extend* current parsers, not 
duplicate them.

h2. Background

Today Spark parses timestamp strings via 
{{SparkDateTimeUtils.parseTimestampString}} and downstream helpers 
({{stringToTimestamp}}, {{stringToTimestampWithoutTimeZone}}, 
{{TimestampFormatter}}). The fractional-second segment is stored in 
{{segments(6)}} but:

* Input with *more than 6* fractional digits is *truncated* to microseconds 
(comment at line 623: "loss of precision").
* Results are returned as a single *microsecond* {{Long}}.

Logical types {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} and 
physical values {{TimestampNTZNanos}} / {{TimestampLTZNanos}} already exist 
([SPARK-56876|https://issues.apache.org/jira/browse/SPARK-56876], [PR #56059 / 
SPARK-56981|https://github.com/apache/spark/pull/56059]). SQL type syntax 
{{TIMESTAMP_NTZ(p)}} / {{TIMESTAMP_LTZ(p)}} is parsed 
([SPARK-56965|https://github.com/apache/spark/commit/4bbf75e9e672ccbcf762f5c7258be501b0ea7f5a]).
 Without this change, string inputs with 7-9 fractional digits cannot be 
converted to the SPIP composite representation.

Per SPIP, internal layout is:

* *epochMicros* -- signed epoch microseconds (same grid as existing timestamps)
* *nanosWithinMicro* -- {{short}} in *\[0, 999\]* (sub-micro remainder, 
normalized)

h2. Scope

# *Extend fractional-second handling in {{parseTimestampString}} (or a shared 
helper it calls)*
#* Accept *1-9* fractional digits (maintain backward-compatible behavior for 
<=6 digits used by micro types).
#* For nanos parsing APIs: derive {{epochMicros}} + {{nanosWithinMicro}} 
without loss for digits 7-9.
#* Apply *precision p* rules: digits beyond *p* truncate or round per SPIP 
(document choice; align with future cast behavior).
#* Preserve existing accepted formats (ISO-8601 variants, space/{{T}} 
separator, optional zone suffix) documented on {{parseTimestampString}}.

# *Add new package-private parse entry points*, e.g.:
#* {{stringToTimestampNTZNanos(s: UTF8String, precision: Int): 
Option[TimestampNTZNanos]}}
#* {{stringToTimestampLTZNanos(s: UTF8String, precision: Int, timeZoneId: 
ZoneId): Option[TimestampLTZNanos]}}
#* ANSI variants that throw on invalid input (mirror {{stringToTimestampAnsi}}).

# *Normalization invariant:* output always satisfies {{nanosWithinMicro}} in 
\[0, 999\]; carry into {{epochMicros}} when needed.

# *Tests* (new suite, e.g. {{TimestampNanosParseSuite}} in {{sql/catalyst}}):
#* 7-, 8-, and 9-digit fractions at each precision *p*
#* Edge cases: {{.0}}, {{.999999999}}, trailing zeros, exactly 6 digits 
(micro-compatible path)
#* NTZ vs LTZ (session / explicit zone)
#* Invalid: 10+ fractional digits, fractional part out of range after 
normalization
#* Regression: existing {{TimestampFormatterSuite}} / micro parse tests 
unchanged

h2. Implementation notes

* Prefer extending {{SparkDateTimeUtils.parseTimestampString}} rather than a 
second parser; micro paths should keep current behavior (6-digit cap + micro 
{{Long}}).
* Reuse {{TimestampNTZNanos}} validation ({{nanosWithinMicro}} in \[0, 999\]).
* Consider using {{java.time}} ({{LocalDateTime.of(..., nanoOfSecond)}}) 
internally for date/time + fraction assembly, then decompose into 
{{(epochMicros, nanosWithinMicro)}} -- consistent with existing 
{{stringToTimestamp}} structure.
* Do *not* change behavior of existing {{TimestampType}} / {{TimestampNTZType}} 
string parsing.

h2. Acceptance criteria

* New parse APIs return normalized {{TimestampNTZNanos}} / 
{{TimestampLTZNanos}} for valid strings with up to 9 fractional digits.
* Precision *p* in {7, 8, 9} enforced on excess digits per documented 
truncate/round rule.
* Existing microsecond parse tests pass without modification.
* New unit tests cover NTZ/LTZ, time-zone suffixes, and edge-case corpus 
(epoch, 1582 cutover, 9999 end range) with sub-micro fractions.

h2. Dependencies

* *Requires:* [SPARK-56981|https://github.com/apache/spark/pull/56059] merged 
(physical value types).
* *Blocks:* string-to-nanos cast matrix, typed SQL literals with sub-micro 
values, Parquet/string ingest tests, {{RandomDataGenerator}} {{specialTs}} 
corpus.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to