fitzee commented on PR #40127:
URL: https://github.com/apache/superset/pull/40127#issuecomment-4483522272

   ## Proposed fix for the MEDIUM blocker (mixed OOB + malformed column)
   
   Hi @eschutho — to save another review round, here's the exact fix for the 
invariant issue flagged in prior reviews. The commit is at 
[`fitzee/superset:griffen/40127-fix`](https://github.com/fitzee/superset/tree/griffen/40127-fix)
 if you want to cherry-pick.
   
   ---
   
   ### The issue
   
   `_convert_temporal_columns()` catches `OutOfBoundsDatetime` on the whole 
column, then falls back to `errors="coerce"` on the whole column. This means a 
column like `["3118-01-01", "not-a-date"]` silently coerces "not-a-date" → NaT 
instead of raising, because the OOB exception is caught first and the fallback 
coerces everything.
   
   ### The fix — `superset/commands/dataset/importers/v1/utils.py`
   
   Replace the `except OutOfBoundsDatetime:` block:
   
   ```python
               except OutOfBoundsDatetime:
                   # Row-level fallback: coerce only OOB values; re-raise for 
malformed
                   # strings. Whole-column errors="coerce" would silently 
swallow
                   # malformed values that happen to share a column with an OOB 
date.
                   original = df[column_name].copy()
                   result = []
                   for val in original:
                       if pd.isna(val):
                           result.append(pd.NaT)
                           continue
                       try:
                           result.append(pd.to_datetime(val))
                       except OutOfBoundsDatetime:
                           result.append(pd.NaT)
                       # Other exceptions (e.g. malformed strings) propagate
                   converted = pd.Series(result, index=original.index)
                   n_coerced = int(converted.isna().sum() - 
original.isna().sum())
                   if n_coerced > 0:
                       logger.warning(
                           "Coerced %d out-of-bounds datetime value(s) "
                           "in column '%s' to NaT",
                           n_coerced,
                           column_name,
                       )
                   df[column_name] = converted
   ```
   
   ### Test changes — `tests/unit_tests/commands/importers/v1/utils_test.py`
   
   1. **Add parametrized mixed-column test** (the missing proof):
   
   ```python
   @pytest.mark.parametrize(
       "values",
       [
           ["3118-01-01", "not-a-date"],
           ["not-a-date", "3118-01-01"],
       ],
   )
   def test_mixed_out_of_bounds_and_malformed_still_raises(
       self, values: list[str]
   ) -> None:
       """
       A column mixing out-of-bounds and malformed dates must raise, not 
silently
       coerce the malformed value to NaT. Both orderings are tested to ensure 
the
       invariant holds regardless of which error pandas encounters first.
       """
       from sqlalchemy import DateTime
   
       from superset.commands.dataset.importers.v1.utils import (
           _convert_temporal_columns,
       )
   
       df = pd.DataFrame({"ts": values})
       with pytest.raises((ValueError, pd.errors.ParserError)):
           _convert_temporal_columns(df, {"ts": DateTime()})
   ```
   
   2. **Fix exception type** in `test_malformed_dates_still_raise`:
      ```python
      # Before
      with pytest.raises(pd.errors.ParserError):
      # After
      with pytest.raises((ValueError, pd.errors.ParserError)):
      ```
   
   3. **Remove unused helpers** `_make_dataset_col` and 
`_apply_temporal_conversion` (and the `MagicMock` import).
   
   ---
   
   With the row-level fallback in place, both mixed-column parametrize cases 
raise correctly. The existing tests (`test_out_of_bounds_coerced_to_nat`, 
`test_warning_count_excludes_preexisting_nulls`) continue to pass unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to