EeshanBembi commented on PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#issuecomment-3348650594
> I tried the reproducer from #17516 and it still fails on this PR:
>
> Maybe I don't understand how to use it 🤔
>
> ```sql
> > create external table foo stored as csv location
'/Users/andrewlamb/Downloads/services' options ('truncated_rows' true);
> 0 row(s) fetched.
> Elapsed 0.021 seconds.
>
> > select * from foo limit 10;
> Arrow error: Csv error: incorrect number of fields for line 1, expected 17
got 20
> ```
>
> It also errors if I just try to read the directory directly:
>
> ```sql
> (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run
--bin datafusion-cli
> Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
> Running `target/debug/datafusion-cli`
> DataFusion CLI v50.0.0
> > select * from '/Users/andrewlamb/Downloads/services' limit 10;
> Arrow error: Csv error: incorrect number of fields for line 1, expected 17
got 20
> ```
>
> This PR seems like a step in the right direction to me, it just doesn't
seem to fix the problem entirely
>
> It sounds like (as follow on issues / PRs) we probably would need to:
>
> 1. Enable schema merging for CSV by default (
> 2. Implement schema merge using column names (not positions) which is how
parquet works, and I think what users would expect.
Hi Andrew! I've reproduced the exact scenario and found the issue. The PR is
working correctly for
external tables, but there's a subtle distinction:
What Works ✅
CREATE EXTERNAL TABLE foo STORED AS CSV LOCATION
'/Users/andrewlamb/Downloads/services'
OPTIONS ('truncated_rows' 'true'); -- Note: 'true' in quotes
SELECT * FROM foo LIMIT 10;
What Still Fails ❌
SELECT * FROM '/Users/andrewlamb/Downloads/services' LIMIT 10;
The Issue
You were getting the error because:
1. Direct file path queries (SELECT * FROM '/path') don't support
CSV-specific options like
truncated_rows - this is a separate limitation not addressed by this PR
2. Option syntax: Make sure to use 'truncated_rows' 'true' (with quotes
around true) not
'truncated_rows' true
Testing
I created files with exactly 17 vs 20 columns and confirmed:
- ✅ External table with OPTIONS ('truncated_rows' 'true') works perfectly
- merges schemas and
fills missing columns with NULL
- ❌ Direct path queries still fail with the same error you saw
Summary
This PR does fix the core issue - CSV schema merging with different column
counts works via external
tables. The remaining limitation is that direct file path queries don't
yet support format-specific
options.
If you think direct path query support is important for this PR, I'm happy
to investigate adding
that functionality here - it would involve enhancing how DataFusion
handles table resolution for
file paths to pass through CSV-specific options. Otherwise, try the
external table approach with
proper option syntax and it should work!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]