fitzee commented on PR #40127:
URL: https://github.com/apache/superset/pull/40127#issuecomment-4456621215

   ## Griffen Re-Review — Round 2 🦅
   
   *Reviewed by Griffen (AI Distinguished Engineer), with Matt Fitzgerald. 
Reach out to Matt directly if anything needs clarifying.*
   
   ---
   
   Ok so — the first round of changes were genuinely good. 
`_convert_temporal_columns()` with try-strict-first, catch only 
`OutOfBoundsDatetime`, then fall back — that's the right pattern. The four new 
tests cover the important cases. The inline comment on `n_coerced` is clear. 
`result_set.py` is solid.
   
   **ALMOST... but not quite.** 🎯
   
   ---
   
   ## The Remaining Gap
   
   There's one edge case the current implementation doesn't handle: a column 
containing **both** out-of-bounds timestamps **and** malformed values.
   
   ```python
   # What happens with this input?
   df = pd.DataFrame({"ts": ["3118-01-01", "not-a-date"]})
   ```
   
   Walk through the code:
   
   1. `pd.to_datetime(df["ts"])` — pandas hits `"3118-01-01"` first → raises 
`OutOfBoundsDatetime` ✓ caught
   2. Fallback: `pd.to_datetime(df["ts"], errors="coerce")` — now **both** 
values become `NaT`
   3. `"not-a-date"` silently disappears into `NaT`. No error. Import succeeds. 
🙈
   
   The behavior is **order-dependent**: if `"not-a-date"` had appeared first, 
pandas would raise a `ValueError` (not `OutOfBoundsDatetime`), the except 
clause wouldn't catch it, and the import would correctly fail.
   
   So whether a malformed value corrupts your import silently or fails loudly 
depends on... where it happens to sit in the CSV. That's not a contract you 
want to ship.
   
   ---
   
   ## The Fix
   
   Row-level fallback in the except block — parse each value individually so 
you can distinguish OOB from malformed:
   
   ```python
   except OutOfBoundsDatetime:
       result = []
       n_coerced = 0
       for val in df[column_name]:
           if pd.isna(val):
               result.append(pd.NaT)
           else:
               try:
                   result.append(pd.to_datetime(val))
               except OutOfBoundsDatetime:
                   result.append(pd.NaT)
                   n_coerced += 1
               # any other parse error (ValueError, ParserError, etc.) re-raises
               # preserving the original import-fail behavior for bad data
       df[column_name] = result
       if n_coerced > 0:
           logger.warning(
               "Coerced %d out-of-bounds datetime value(s) in column '%s' to 
NaT",
               n_coerced,
               column_name,
           )
   ```
   
   Yes, this is O(n) per-value instead of vectorized — but this is a CSV import 
path, not a hot query path. Correctness wins.
   
   And the test to add:
   
   ```python
   def test_mixed_oob_and_malformed_raises(self) -> None:
       """OOB + malformed in same column: malformed should still fail the 
import."""
       df = pd.DataFrame({"ts": ["3118-01-01", "not-a-date"]})
       with pytest.raises(Exception):
           _convert_temporal_columns(df, {"ts": DateTime()})
   ```
   
   ---
   
   ## What's Already Good (and stays good)
   
   - `result_set.py` — no changes needed, ship it
   - The 4 existing tests — all valid, keep them
   - The overall structure of `_convert_temporal_columns()` — right idea, just 
needs the row-level fallback
   
   ---
   
   So close! One more pass on the fallback and this is done. 🦅


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to