alamb opened a new issue, #17516: URL: https://github.com/apache/datafusion/issues/17516
### Describe the bug I was trying to see how fast the Datafusion CSV parser was by using the example from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 but DataFusion refused to load it for a few reasons ### To Reproduce ```shell wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip unzip railway-services-80-months.zip ``` Then run ```shell datafusion-cli ``` ```sql andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ datafusion-cli DataFusion CLI v49.0.2 > select * from 'services'; Arrow error: Csv error: incorrect number of fields for line 1, expected 17 got 20 ``` ### Expected behavior I expect the directory to be treated as a single table, correctly Note that selecting from individual files work fine: ```sql > select * from 'services/services-2025-01.csv' limit 10; +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ | Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | Stop:Arrival time | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time | Stop:Departure delay | Stop:Departure cancelled | +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ | 15122556 | 2025-01-07 | Intercity | NS | 3022 | false | false | 0 | 136517277 | AMRN | Alkmaar Noord | 2025-01-07T08:52:00 | 1 | false | 2025-01-07T09:52:00+01:00 | 1 | false | ... | 15122557 | 2025-01-07 | Sprinter | NS | 6929 | false | false | 0 | 136517286 | ZTM | Zoetermeer | 2025-01-07T06:55:00 | 0 | false | 2025-01-07T07:55:00+01:00 | 0 | false | +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ 10 row(s) fetched. Elapsed 0.019 seconds. ``` Also, if you remove the few most recent ```shell rm services/services-2025-04.csv rm services/services-2025-05.csv rm services/services-2025-06.csv rm services/services-2025-07.csv rm services/services-2025-08.csv ``` It works fine: ```sql > select * from 'services' limit 10; +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ | Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | Stop:Arrival time | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time | Stop:Departure delay | Stop:Departure cancelled | +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ | 15566803 | 2025-03-15 | Intercity | NS | 3970 | false | false | 0 | 140690195 | UT | Utrecht Centraal | 2025-03-15T20:17:00+01:00 | 0 | false | 2025-03-15T20:19:00+01:00 | 0 | false | ... | 15566803 | 2025-03-15 | Intercity | NS | 3970 | false | false | 0 | 140690204 | BKF | Bovenkarspel Flora | 2025-03-15T21:49:00+01:00 | 0 | false | 2025-03-15T21:49:00+01:00 | 0 | false | +----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+ 10 row(s) fetched. Elapsed 0.015 seconds. ``` ```sql > copy 'services' to 'services.parquet'; +-----------+ | count | +-----------+ | 135465619 | +-----------+ 1 row(s) fetched. Elapsed 9.661 seconds. ``` ### Additional context I suspect that the issue is that 'services/services-2025-07.csv' and others has 20 columns while the other files have 17 columns and for some reason the CSV format is not adapting the schema the same way as the parquet can ```sql > describe 'services/services-2024.csv'; +------------------------------+-----------+-------------+ | column_name | data_type | is_nullable | +------------------------------+-----------+-------------+ | Service:RDT-ID | Int64 | YES | | Service:Date | Date32 | YES | | Service:Type | Utf8 | YES | | Service:Company | Utf8 | YES | | Service:Train number | Int64 | YES | | Service:Completely cancelled | Boolean | YES | | Service:Partly cancelled | Boolean | YES | | Service:Maximum delay | Int64 | YES | | Stop:RDT-ID | Int64 | YES | | Stop:Station code | Utf8 | YES | | Stop:Station name | Utf8 | YES | | Stop:Arrival time | Utf8 | YES | | Stop:Arrival delay | Utf8 | YES | | Stop:Arrival cancelled | Utf8 | YES | | Stop:Departure time | Utf8 | YES | | Stop:Departure delay | Utf8 | YES | | Stop:Departure cancelled | Utf8 | YES | +------------------------------+-----------+-------------+ 17 row(s) fetched. Elapsed 0.008 seconds. > describe 'services/services-2025-07.csv'; +------------------------------+-----------+-------------+ | column_name | data_type | is_nullable | +------------------------------+-----------+-------------+ | Service:RDT-ID | Int64 | YES | | Service:Date | Date32 | YES | | Service:Type | Utf8 | YES | | Service:Company | Utf8 | YES | | Service:Train number | Int64 | YES | | Service:Completely cancelled | Boolean | YES | | Service:Partly cancelled | Boolean | YES | | Service:Maximum delay | Int64 | YES | | Stop:RDT-ID | Int64 | YES | | Stop:Station code | Utf8 | YES | | Stop:Station name | Utf8 | YES | | Stop:Arrival time | Utf8 | YES | | Stop:Arrival delay | Utf8 | YES | | Stop:Arrival cancelled | Utf8 | YES | | Stop:Departure time | Utf8 | YES | | Stop:Departure delay | Utf8 | YES | | Stop:Departure cancelled | Utf8 | YES | | Stop:Platform change | Boolean | YES | <-- New columns! | Stop:Planned platform | Utf8 | YES | | Stop:Actual platform | Utf8 | YES | +------------------------------+-----------+-------------+ 20 row(s) fetched. Elapsed 0.005 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
