Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

via GitHub Sat, 18 Oct 2025 12:43:38 -0700


EeshanBembi commented on PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#issuecomment-3348650594


   > I tried the reproducer from #17516 and it still fails on this PR:
   > 
   > Maybe I don't understand how to use it 🤔
   > 
   > ```sql
   > > create external table foo stored as csv location 
'/Users/andrewlamb/Downloads/services' options ('truncated_rows' true);
   > 0 row(s) fetched.
   > Elapsed 0.021 seconds.
   > 
   > > select * from foo limit 10;
   > Arrow error: Csv error: incorrect number of fields for line 1, expected 17 
got 20
   > ```
   > 
   > It also errors if I just try to read the directory directly:
   > 
   > ```sql
   > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run 
--bin datafusion-cli
   >     Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
   >      Running `target/debug/datafusion-cli`
   > DataFusion CLI v50.0.0
   > > select * from '/Users/andrewlamb/Downloads/services' limit 10;
   > Arrow error: Csv error: incorrect number of fields for line 1, expected 17 
got 20
   > ```
   > 
   > This PR seems like a step in the right direction to me, it just doesn't 
seem to fix the problem entirely
   > 
   > It sounds like (as follow on issues / PRs) we probably would need to:
   > 
   > 1. Enable schema merging for CSV by default (
   > 2. Implement schema merge using column names (not positions) which is how 
parquet works, and I think what users would expect.
   
   Hi Andrew! I've reproduced the exact scenario and found the issue. The PR is 
working correctly for
     external tables, but there's a subtle distinction:
   
     What Works ✅
   
     CREATE EXTERNAL TABLE foo STORED AS CSV LOCATION 
'/Users/andrewlamb/Downloads/services'
     OPTIONS ('truncated_rows' 'true');  -- Note: 'true' in quotes
     SELECT * FROM foo LIMIT 10;
   
     What Still Fails ❌
   
     SELECT * FROM '/Users/andrewlamb/Downloads/services' LIMIT 10;
   
     The Issue
   
     You were getting the error because:
     1. Direct file path queries (SELECT * FROM '/path') don't support 
CSV-specific options like
     truncated_rows - this is a separate limitation not addressed by this PR
     2. Option syntax: Make sure to use 'truncated_rows' 'true' (with quotes 
around true) not
     'truncated_rows' true
   
     Testing
   
     I created files with exactly 17 vs 20 columns and confirmed:
     - ✅ External table with OPTIONS ('truncated_rows' 'true') works perfectly 
- merges schemas and
     fills missing columns with NULL
     - ❌ Direct path queries still fail with the same error you saw
   
     Summary
   
     This PR does fix the core issue - CSV schema merging with different column 
counts works via external
      tables. The remaining limitation is that direct file path queries don't 
yet support format-specific
      options.
   
     If you think direct path query support is important for this PR, I'm happy 
to investigate adding 
     that functionality here - it would involve enhancing how DataFusion 
handles table resolution for
     file paths to pass through CSV-specific options. Otherwise, try the 
external table approach with
     proper option syntax and it should work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

Reply via email to