fitzee commented on PR #40127: URL: https://github.com/apache/superset/pull/40127#issuecomment-4455426870
## Griffen Review — `fix: coerce out-of-bounds nanosecond timestamps to NaT instead of raising` *Reviewed by Griffen (AI Distinguished Engineer), with Matt Fitzgerald. Reach out to Matt directly if anything needs clarifying.* --- ## Context **Sources consulted:** PR description, diff, `result_set.py` surrounding context (master branch) **Missing context:** No linked Shortcut story or GitHub issue — reviewing based on PR description alone. **Intent:** Stop `OutOfBoundsDatetime` exceptions on tz-aware datetime columns with dates beyond ~2262-04-11 from surfacing as ERROR log noise in observability tooling on every chart auto-refresh. *(PR description)* --- ## Door Classification **Two-way door** | Dimension | Assessment | |---|---| | Rollback complexity | **Simple** — remove `errors="coerce"` from two call sites, remove the warning block | | Blast radius | **Narrow** — only affects tz-aware datetime columns with values beyond 2262-04-11; no schema changes, no migrations | | Deployment reversible | **Yes** | | Reversibility mechanisms | Not needed; inherently reversible | --- ## Findings **[MEDIUM] [Two-way] — `result_set.py` coercion is silent; users see null with no explanation** `utils.py` correctly logs a warning with a count of coerced values. `result_set.py` does not. A user querying a table with year 3118 dates will see `NaT`/null cells in their chart with no indication of why. The PR description says NaT is preferable to clamping because "clamping silently lies" — but silent NaT in query results is also a lie, just a different one. *Confidence: Fact* *Mitigation:* Add a `logger.warning()` in `result_set.py` analogous to the one in `utils.py` — per-column, ideally suppressed after the first occurrence to avoid noise on auto-refresh. At minimum, document the silent behavior explicitly. --- **[LOW] [Two-way] — Import path (`utils.py`) has no test coverage** The warning-with-count behavior in `utils.py` is untested. If `errors="coerce"` is accidentally removed or the warning logic is broken, nothing catches it. *Confidence: Fact* *Mitigation:* Add a unit test for `load_data` that exercises the coercion path and asserts the warning is logged with the correct count. --- **[LOW] — Fix is scoped to tz-aware datetime path only** The fix in `result_set.py` is inside `if sample.tzinfo:` — only timezone-aware datetimes. For tz-naive datetime columns with out-of-bounds values, a different code path applies (direct PyArrow conversion, not pandas). That path presumably still raises and logs at ERROR. May be intentionally out of scope, but worth a follow-up ticket or comment. *Confidence: Fact (observable in diff and surrounding context)* --- **[NIT] — `n_coerced` arithmetic deserves a comment** ```python n_coerced = int(converted.isna().sum() - df[column_name].isna().sum()) ``` Correct — counts net-new NaTs introduced by coercion — but non-obvious on first read. A one-liner would help the next reader. --- ## What's Sound The fix is correct. `errors="coerce"` → NaT is the right resolution for this pandas limitation. The PR description correctly rules out clamping (silently misrepresents the value) and string conversion (breaks the PyArrow column schema). The outer `try/except Exception` in `result_set.py` is preserved as a safety net; this change merely prevents the known out-of-bounds case from reaching it. The test (`test_out_of_bounds_datetime_coerced_to_nat`) verifies both what should happen (NaT in the output) and what should *not* happen (`logger.exception` not called) — the right dual assertion for a fix whose entire point is eliminating ERROR log noise. The import path warning is good practice: it converts a silent data transformation into an observable event for operators, without blocking the import. --- ## Architectural Read Pandas' int64 nanosecond timestamp ceiling (~2262) is a well-known platform constraint. The correct mitigation at each boundary is coerce-with-logging, not raise. This PR correctly applies that to two call sites but leaves tz-naive out-of-bounds datetimes hitting a different raise path elsewhere — a known scope limitation. The silent coercion in `result_set.py` is the one design judgment worth revisiting. The choice isn't *ERROR log noise vs. silence* — it's *ERROR log noise vs. WARNING + explanation*. The import path already demonstrates the better pattern; it just needs to be applied consistently to the query path too. --- **Verdict: LGTM with nits** — the silent coercion in `result_set.py` is worth addressing, but not a blocker. Core fix is correct, test verifies the right invariants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
