fitzee commented on PR #40127:
URL: https://github.com/apache/superset/pull/40127#issuecomment-4455426870

   ## Griffen Review — `fix: coerce out-of-bounds nanosecond timestamps to NaT 
instead of raising`
   
   *Reviewed by Griffen (AI Distinguished Engineer), with Matt Fitzgerald. 
Reach out to Matt directly if anything needs clarifying.*
   
   ---
   
   ## Context
   
   **Sources consulted:** PR description, diff, `result_set.py` surrounding 
context (master branch)
   **Missing context:** No linked Shortcut story or GitHub issue — reviewing 
based on PR description alone.
   **Intent:** Stop `OutOfBoundsDatetime` exceptions on tz-aware datetime 
columns with dates beyond ~2262-04-11 from surfacing as ERROR log noise in 
observability tooling on every chart auto-refresh. *(PR description)*
   
   ---
   
   ## Door Classification
   
   **Two-way door**
   
   | Dimension | Assessment |
   |---|---|
   | Rollback complexity | **Simple** — remove `errors="coerce"` from two call 
sites, remove the warning block |
   | Blast radius | **Narrow** — only affects tz-aware datetime columns with 
values beyond 2262-04-11; no schema changes, no migrations |
   | Deployment reversible | **Yes** |
   | Reversibility mechanisms | Not needed; inherently reversible |
   
   ---
   
   ## Findings
   
   **[MEDIUM] [Two-way] — `result_set.py` coercion is silent; users see null 
with no explanation**
   
   `utils.py` correctly logs a warning with a count of coerced values. 
`result_set.py` does not. A user querying a table with year 3118 dates will see 
`NaT`/null cells in their chart with no indication of why. The PR description 
says NaT is preferable to clamping because "clamping silently lies" — but 
silent NaT in query results is also a lie, just a different one.
   
   *Confidence: Fact*
   *Mitigation:* Add a `logger.warning()` in `result_set.py` analogous to the 
one in `utils.py` — per-column, ideally suppressed after the first occurrence 
to avoid noise on auto-refresh. At minimum, document the silent behavior 
explicitly.
   
   ---
   
   **[LOW] [Two-way] — Import path (`utils.py`) has no test coverage**
   
   The warning-with-count behavior in `utils.py` is untested. If 
`errors="coerce"` is accidentally removed or the warning logic is broken, 
nothing catches it.
   
   *Confidence: Fact*
   *Mitigation:* Add a unit test for `load_data` that exercises the coercion 
path and asserts the warning is logged with the correct count.
   
   ---
   
   **[LOW] — Fix is scoped to tz-aware datetime path only**
   
   The fix in `result_set.py` is inside `if sample.tzinfo:` — only 
timezone-aware datetimes. For tz-naive datetime columns with out-of-bounds 
values, a different code path applies (direct PyArrow conversion, not pandas). 
That path presumably still raises and logs at ERROR. May be intentionally out 
of scope, but worth a follow-up ticket or comment.
   
   *Confidence: Fact (observable in diff and surrounding context)*
   
   ---
   
   **[NIT] — `n_coerced` arithmetic deserves a comment**
   
   ```python
   n_coerced = int(converted.isna().sum() - df[column_name].isna().sum())
   ```
   
   Correct — counts net-new NaTs introduced by coercion — but non-obvious on 
first read. A one-liner would help the next reader.
   
   ---
   
   ## What's Sound
   
   The fix is correct. `errors="coerce"` → NaT is the right resolution for this 
pandas limitation. The PR description correctly rules out clamping (silently 
misrepresents the value) and string conversion (breaks the PyArrow column 
schema). The outer `try/except Exception` in `result_set.py` is preserved as a 
safety net; this change merely prevents the known out-of-bounds case from 
reaching it.
   
   The test (`test_out_of_bounds_datetime_coerced_to_nat`) verifies both what 
should happen (NaT in the output) and what should *not* happen 
(`logger.exception` not called) — the right dual assertion for a fix whose 
entire point is eliminating ERROR log noise.
   
   The import path warning is good practice: it converts a silent data 
transformation into an observable event for operators, without blocking the 
import.
   
   ---
   
   ## Architectural Read
   
   Pandas' int64 nanosecond timestamp ceiling (~2262) is a well-known platform 
constraint. The correct mitigation at each boundary is coerce-with-logging, not 
raise. This PR correctly applies that to two call sites but leaves tz-naive 
out-of-bounds datetimes hitting a different raise path elsewhere — a known 
scope limitation.
   
   The silent coercion in `result_set.py` is the one design judgment worth 
revisiting. The choice isn't *ERROR log noise vs. silence* — it's *ERROR log 
noise vs. WARNING + explanation*. The import path already demonstrates the 
better pattern; it just needs to be applied consistently to the query path too.
   
   ---
   
   **Verdict: LGTM with nits** — the silent coercion in `result_set.py` is 
worth addressing, but not a blocker. Core fix is correct, test verifies the 
right invariants.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to