Max Gekk created SPARK-57250:
--------------------------------
Summary: Construct sub-microsecond timestamp typed literals with
precision derived from fractional digits
Key: SPARK-57250
URL: https://issues.apache.org/jira/browse/SPARK-57250
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h2. What
Make the SQL typed-literal constructor in {{AstBuilder.visitTypeConstructor}}
produce
nanosecond-capable timestamp literals when the literal string carries 7-9
fractional-second
digits, instead of truncating to a microsecond {{Long}}. Following the ANSI SQL
rule, the
fractional-second precision {{p}} of the literal is *derived from the number of
digits in the
fractional part of the string* - there is no precision argument in the literal
syntax.
Concretely, for a typed literal {{[TYPE] '<value>'}}:
* 0-6 fractional digits -> unchanged: a microsecond literal of
{{TimestampType}} /
{{TimestampNTZType}} (current behavior; aligned with SPARK-57163).
* 7-9 fractional digits -> a nanos literal:
{{Literal(TimestampNTZNanos(epochMicros, nanosWithinMicro),
TimestampNTZNanosType(p))}}
or the LTZ counterpart, where {{p}} is the fractional digit count.
* more than 9 fractional digits -> a user-facing error (precision exceeds the
maximum of 9).
Parent: SPARK-56822. Builds on SPARK-56876 (logical types), SPARK-56965 (parser
names the
types), SPARK-56981 (physical row storage), SPARK-57032 (string parsing for 7-9
digits), and
SPARK-57033 (java.time / composite conversion). Complements SPARK-57163 (p<=6
-> micro mapping)
and the test-side SPARK-57165 (LiteralGenerator).
h2. Why
The parser can already *name* {{TIMESTAMP_NTZ(p)}} / {{TIMESTAMP_LTZ(p)}} types
(SPARK-56965), but a typed *literal value* with sub-microsecond precision is
still silently
narrowed to micros. Today {{visitTypeConstructor}} routes {{TIMESTAMP}} /
{{TIMESTAMP_NTZ}} /
{{TIMESTAMP_LTZ}} through {{stringToTimestamp}} /
{{stringToTimestampWithoutTimeZone}}, which
return epoch-microsecond {{Long}}s, so {{TIMESTAMP_NTZ '2020-01-01
00:00:00.123456789'}} loses
digits 7-9. Typed literals are the most direct way for users to introduce
constant nanos
values in SQL (filters, INSERT ... VALUES, constant folding), so this unblocks
the minimum
user-visible SQL surface for the preview feature.
h2. ANSI SQL behavior to follow
Per ISO/IEC 9075-2 (SQL/Foundation):
* Grammar: {{<timestamp literal> ::= TIMESTAMP <timestamp string>}}, where the
fractional part
is {{<seconds fraction> ::= <unsigned integer>}} (no fixed digit limit, and
*no precision
token in the literal*).
* Subclause 5.3, Syntax Rule 27: the declared type of a {{<timestamp literal>}}
is
{{TIMESTAMP(P)}}, "where P is the number of digits in <seconds fraction>".
WITH vs WITHOUT
TIME ZONE is determined by whether the string contains a {{<time zone
interval>}} (offset).
* Subclause 6.1, Syntax Rules 34 and 36: default {{<timestamp precision>}} is
6; the maximum is
implementation-defined and "not less than 6". Spark caps it at 9
(nanoseconds).
This matches Spark's existing literal grammar, which has no precision argument:
{{literalType singleStringLitWithoutMarker}} with {{literalType}} one of
{{TIMESTAMP | TIMESTAMP_NTZ | TIMESTAMP_LTZ}}. So no grammar change is
required; only the
value-construction side of {{visitTypeConstructor}} changes.
h2. Scope
{{sql/catalyst/.../parser/AstBuilder.scala}} ({{visitTypeConstructor}}):
* Determine the fractional digit count of the literal string and pick micro vs
nanos accordingly.
* For 7-9 digits, build {{TimestampNTZNanos}} / {{TimestampLTZNanos}} values
via the
SPARK-57032 nanos parse helpers and wrap them in {{Literal(...,
TimestampNTZNanosType(p))}} /
{{TimestampLTZNanosType(p)}}.
* Keep the existing NTZ/LTZ resolution for the bare {{TIMESTAMP}} keyword: when
the string
contains a time-zone offset, resolve to the LTZ (WITH TIME ZONE) variant;
otherwise NTZ. The
explicit {{TIMESTAMP_NTZ}} / {{TIMESTAMP_LTZ}} keywords pin the variant.
* Validate precision: more than 9 fractional digits raises the existing
user-facing
invalid-precision / cannot-parse error.
* Gate behind {{spark.sql.timestampNanosTypes.enabled}}. When the flag is off,
retain current
microsecond behavior (parse and narrow at 6 digits) exactly as today.
* Reuse {{convertSpecialTimestamp}} / {{convertSpecialTimestampNTZ}} for
special values (these
remain micro).
h2. Out of scope
* Custom-format / pattern-based literal parsing (covered by the nanos
TimestampFormatter,
SPARK-57162).
* Casts and type coercion for nanos types (separate cast-matrix and coercion
subtasks).
* {{LiteralGenerator}} test support (SPARK-57165).
* Any new literal syntax with an explicit precision token (e.g.
{{TIMESTAMP_NTZ(9) '...'}}) -
not ANSI and not in scope.
h2. Design notes
* Precision is *implied*, never declared, in the literal - count the digits
after the decimal
point in the seconds field.
* {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} require precision in
[7, 9]; literals
with 0-6 fractional digits therefore cannot be nanos types and stay micro by
construction.
* Reuse the {{TimestampNanosVal}} normalization invariant ({{nanosWithinMicro}}
in [0, 999]).
* No rounding of sub-precision digits is needed here because {{p}} equals the
exact digit count;
truncation/rounding rules belong to cast/format paths.
h2. How was this patch tested
* New cases in the parser/expression literal suites: parse round-trip for
{{TIMESTAMP_NTZ}} /
{{TIMESTAMP_LTZ}} / {{TIMESTAMP}} literals with 7, 8, and 9 fractional
digits; assert the
resulting {{Literal}} dataType is {{TimestampNTZNanosType(p)}} /
{{TimestampLTZNanosType(p)}}
and the composite value matches a java.time oracle (SPARK-57033 helpers).
* Boundary cases: {{nanosWithinMicro}} 0 and 999, pre-epoch instants (e.g. 1582
cutover),
9999-12-31, and exactly 6 digits (must stay micro).
* Negative: more than 9 fractional digits raises the expected error; behavior
with the preview
flag disabled is unchanged.
* TIMESTAMP-keyword NTZ-vs-LTZ resolution based on presence of a time-zone
offset.
h2. Does this PR introduce any user-facing change
Yes, but gated. When {{spark.sql.timestampNanosTypes.enabled}} is true, typed
timestamp literals
with 7-9 fractional digits resolve to nanosecond-capable types instead of being
narrowed to
microseconds, e.g.:
{code:sql}
SELECT TIMESTAMP_NTZ '2020-01-01 00:00:00.123456789'; --
TimestampNTZNanosType(9)
{code}
When the flag is false (production default), behavior is unchanged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]