[ 
https://issues.apache.org/jira/browse/SPARK-57032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-57032:
--------------------------------

    Assignee: Max Gekk

> Extend timestamp string parsing for nanosecond fractional precision (p in 7–9)
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-57032
>                 URL: https://issues.apache.org/jira/browse/SPARK-57032
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>
> h2. Summary
> Extend Spark's existing timestamp string parsing to preserve fractional 
> seconds beyond microsecond precision and produce normalized 
> nanosecond-capable internal values {{(epochMicros, nanosWithinMicro)}} for 
> {{TimestampNTZNanosType(p)}} and {{TimestampLTZNanosType(p)}} where *p* is in 
> \[7, 9\].
> This is the first sub-task of the nanosecond datetime conversion utilities 
> work under [SPARK-56822|https://issues.apache.org/jira/browse/SPARK-56822] 
> (SPIP: Timestamps with nanosecond precision). It must *extend* current 
> parsers, not duplicate them.
> h2. Background
> Today Spark parses timestamp strings via 
> {{SparkDateTimeUtils.parseTimestampString}} and downstream helpers 
> ({{stringToTimestamp}}, {{stringToTimestampWithoutTimeZone}}, 
> {{TimestampFormatter}}). The fractional-second segment is stored in 
> {{segments(6)}} but:
> * Input with *more than 6* fractional digits is *truncated* to microseconds 
> (comment at line 623: "loss of precision").
> * Results are returned as a single *microsecond* {{Long}}.
> Logical types {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} and 
> physical values {{TimestampNTZNanos}} / {{TimestampLTZNanos}} already exist 
> ([SPARK-56876|https://issues.apache.org/jira/browse/SPARK-56876], [PR #56059 
> / SPARK-56981|https://github.com/apache/spark/pull/56059]). SQL type syntax 
> {{TIMESTAMP_NTZ(p)}} / {{TIMESTAMP_LTZ(p)}} is parsed 
> ([SPARK-56965|https://github.com/apache/spark/commit/4bbf75e9e672ccbcf762f5c7258be501b0ea7f5a]).
>  Without this change, string inputs with 7-9 fractional digits cannot be 
> converted to the SPIP composite representation.
> Per SPIP, internal layout is:
> * *epochMicros* -- signed epoch microseconds (same grid as existing 
> timestamps)
> * *nanosWithinMicro* -- {{short}} in *\[0, 999\]* (sub-micro remainder, 
> normalized)
> h2. Scope
> # *Extend fractional-second handling in {{parseTimestampString}} (or a shared 
> helper it calls)*
> #* Accept *1-9* fractional digits (maintain backward-compatible behavior for 
> <=6 digits used by micro types).
> #* For nanos parsing APIs: derive {{epochMicros}} + {{nanosWithinMicro}} 
> without loss for digits 7-9.
> #* Apply *precision p* rules: digits beyond *p* truncate or round per SPIP 
> (document choice; align with future cast behavior).
> #* Preserve existing accepted formats (ISO-8601 variants, space/{{T}} 
> separator, optional zone suffix) documented on {{parseTimestampString}}.
> # *Add new package-private parse entry points*, e.g.:
> #* {{stringToTimestampNTZNanos(s: UTF8String, precision: Int): 
> Option[TimestampNTZNanos]}}
> #* {{stringToTimestampLTZNanos(s: UTF8String, precision: Int, timeZoneId: 
> ZoneId): Option[TimestampLTZNanos]}}
> #* ANSI variants that throw on invalid input (mirror 
> {{stringToTimestampAnsi}}).
> # *Normalization invariant:* output always satisfies {{nanosWithinMicro}} in 
> \[0, 999\]; carry into {{epochMicros}} when needed.
> # *Tests* (new suite, e.g. {{TimestampNanosParseSuite}} in {{sql/catalyst}}):
> #* 7-, 8-, and 9-digit fractions at each precision *p*
> #* Edge cases: {{.0}}, {{.999999999}}, trailing zeros, exactly 6 digits 
> (micro-compatible path)
> #* NTZ vs LTZ (session / explicit zone)
> #* Invalid: 10+ fractional digits, fractional part out of range after 
> normalization
> #* Regression: existing {{TimestampFormatterSuite}} / micro parse tests 
> unchanged
> h2. Implementation notes
> * Prefer extending {{SparkDateTimeUtils.parseTimestampString}} rather than a 
> second parser; micro paths should keep current behavior (6-digit cap + micro 
> {{Long}}).
> * Reuse {{TimestampNTZNanos}} validation ({{nanosWithinMicro}} in \[0, 999\]).
> * Consider using {{java.time}} ({{LocalDateTime.of(..., nanoOfSecond)}}) 
> internally for date/time + fraction assembly, then decompose into 
> {{(epochMicros, nanosWithinMicro)}} -- consistent with existing 
> {{stringToTimestamp}} structure.
> * Do *not* change behavior of existing {{TimestampType}} / 
> {{TimestampNTZType}} string parsing.
> h2. Acceptance criteria
> * New parse APIs return normalized {{TimestampNTZNanos}} / 
> {{TimestampLTZNanos}} for valid strings with up to 9 fractional digits.
> * Precision *p* in {7, 8, 9} enforced on excess digits per documented 
> truncate/round rule.
> * Existing microsecond parse tests pass without modification.
> * New unit tests cover NTZ/LTZ, time-zone suffixes, and edge-case corpus 
> (epoch, 1582 cutover, 9999 end range) with sub-micro fractions.
> h2. Dependencies
> * *Requires:* [SPARK-56981|https://github.com/apache/spark/pull/56059] merged 
> (physical value types).
> * *Blocks:* string-to-nanos cast matrix, typed SQL literals with sub-micro 
> values, Parquet/string ingest tests, {{RandomDataGenerator}} {{specialTs}} 
> corpus.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to