eugenegujing opened a new issue, #5935:
URL: https://github.com/apache/texera/issues/5935

   ### What happened?
   
   When a CSV column holds integer-looking values but also contains missing 
values, the workflow crashes inside any pandas-based Python operator (e.g. 
Sort).
   
   Root cause chain:
   1. CSV File Scan auto-infers such a column as `integer` 
(`inferSchemaFromRows` in `AttributeTypeUtils.scala`), and there is no 
per-column type override in the UI.
   2. Python operators run on pandas. pandas has a hard rule: an integer column 
that contains any NaN is automatically up-cast to float64 (an int column cannot 
hold NaN). So `121` becomes `121.0`.
   3. On output, the Python worker validates each tuple against the declared 
schema, which still says INTEGER. The actual value is a float, so it raises:
   
      ```
      TypeError: Unmatched type for field 'weight', expected AttributeType.INT, 
got 119.0 (<class 'float'>) instead.
   
      File ".../core/models/tuple.py", line 361, in validate_schema (called 
from finalize -> on_finish)
      ```
   
   This affects every "integer column that also has missing values" — the user 
must hit the error, find the column, and manually cast it. In our dataset 
(diabetes.csv) 11 columns are affected (weight, chol, hdl, height, bp.1s, 
bp.1d, bp.2s, bp.2d, waist, hip, time.ppn).
   
   Expected:
   Integer columns containing nulls should be handled gracefully instead of 
crashing. Either:
   - (a) the Python worker's schema validation should coerce an integral float 
(e.g. `119.0`) back to INTEGER and NaN to null, or
   - (b) CSV File Scan should infer a null-containing integer column as DOUBLE, 
or
   - (c) the UI should expose a per-column type override on CSV File Scan.
   
   Current workaround: insert a Type Casting operator and manually cast every 
affected integer column to `double`. This works but is manual and error-prone 
(casting to `integer` instead of `double` silently reproduces the bug).
   
   ### How to reproduce?
   
   1. Prepare a CSV with an integer-valued column that contains at least one 
empty cell, e.g. diabetes.csv where `weight` is all integers except one blank.
   2. Build workflow: CSV File Scan -> Sort. In Sort, sort by any column (e.g. 
`age`).
   3. Run the workflow.
   4. The Sort operator fails on finish with:
      ```
      TypeError: Unmatched type for field 'weight', expected AttributeType.INT, 
got 119.0 (<class 'float'>) instead.
      ```
   
   Workaround that fixes it:
   CSV File Scan -> Type Casting (cast weight/waist/hip/time.ppn -> double) -> 
Sort, then re-run.
   
   ### Version/Branch
   
   1.3.0-incubating-SNAPSHOT (main)
   
   ### Commit Hash (Optional)
   
   _No response_
   
   ### What browsers are you seeing the problem on?
   
   _No response_
   
   ### Relevant log output
   
   ```shell
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to