alamb opened a new issue, #17516:
URL: https://github.com/apache/datafusion/issues/17516

   ### Describe the bug
   
   I was trying to see how fast the Datafusion CSV parser was by using the 
example from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 
but DataFusion refused to load it for a few reasons
   
   ### To Reproduce
   
   ```shell
   wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip
   unzip railway-services-80-months.zip
   ```
   
   Then run 
   ```shell
   datafusion-cli
   ```
   
   ```sql
   andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ datafusion-cli
   DataFusion CLI v49.0.2
   > select * from 'services';
   Arrow error: Csv error: incorrect number of fields for line 1, expected 17 
got 20
   ```
   
   ### Expected behavior
   
   I expect the directory to be treated as a single table, correctly
   
   Note that selecting from individual files work fine:
   
   ```sql
   > select * from 'services/services-2025-01.csv' limit 10;
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   | Service:RDT-ID | Service:Date | Service:Type | Service:Company | 
Service:Train number | Service:Completely cancelled | Service:Partly cancelled 
| Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | 
Stop:Arrival time   | Stop:Arrival delay | Stop:Arrival cancelled | 
Stop:Departure time       | Stop:Departure delay | Stop:Departure cancelled |
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   | 15122556       | 2025-01-07   | Intercity    | NS              | 3022      
           | false                        | false                    | 0        
             | 136517277   | AMRN              | Alkmaar Noord     | 
2025-01-07T08:52:00 | 1                  | false                  | 
2025-01-07T09:52:00+01:00 | 1                    | false                    |
   ...
   | 15122557       | 2025-01-07   | Sprinter     | NS              | 6929      
           | false                        | false                    | 0        
             | 136517286   | ZTM               | Zoetermeer        | 
2025-01-07T06:55:00 | 0                  | false                  | 
2025-01-07T07:55:00+01:00 | 0                    | false                    |
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   10 row(s) fetched.
   Elapsed 0.019 seconds.
   ```
   
   Also, if you remove the few most recent
   ```shell
   rm services/services-2025-04.csv
   rm services/services-2025-05.csv
   rm services/services-2025-06.csv
   rm services/services-2025-07.csv
   rm services/services-2025-08.csv
   ```
   
   It works fine:
   ```sql
   > select * from 'services' limit 10;
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   | Service:RDT-ID | Service:Date | Service:Type | Service:Company | 
Service:Train number | Service:Completely cancelled | Service:Partly cancelled 
| Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name   
     | Stop:Arrival time         | Stop:Arrival delay | Stop:Arrival cancelled 
| Stop:Departure time       | Stop:Departure delay | Stop:Departure cancelled |
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   | 15566803       | 2025-03-15   | Intercity    | NS              | 3970      
           | false                        | false                    | 0        
             | 140690195   | UT                | Utrecht Centraal         | 
2025-03-15T20:17:00+01:00 | 0                  | false                  | 
2025-03-15T20:19:00+01:00 | 0                    | false                    |
   ...
   | 15566803       | 2025-03-15   | Intercity    | NS              | 3970      
           | false                        | false                    | 0        
             | 140690204   | BKF               | Bovenkarspel Flora       | 
2025-03-15T21:49:00+01:00 | 0                  | false                  | 
2025-03-15T21:49:00+01:00 | 0                    | false                    |
   
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
   10 row(s) fetched.
   Elapsed 0.015 seconds.
   ```
   
   ```sql
   > copy 'services' to 'services.parquet';
   +-----------+
   | count     |
   +-----------+
   | 135465619 |
   +-----------+
   1 row(s) fetched.
   Elapsed 9.661 seconds.
   ```
   
   ### Additional context
   
   I suspect that the issue is that 'services/services-2025-07.csv' and others 
has 20 columns while the other files have 17 columns and for some reason the 
CSV format is not adapting the schema the same way as the parquet can
   
   ```sql
   > describe 'services/services-2024.csv';
   +------------------------------+-----------+-------------+
   | column_name                  | data_type | is_nullable |
   +------------------------------+-----------+-------------+
   | Service:RDT-ID               | Int64     | YES         |
   | Service:Date                 | Date32    | YES         |
   | Service:Type                 | Utf8      | YES         |
   | Service:Company              | Utf8      | YES         |
   | Service:Train number         | Int64     | YES         |
   | Service:Completely cancelled | Boolean   | YES         |
   | Service:Partly cancelled     | Boolean   | YES         |
   | Service:Maximum delay        | Int64     | YES         |
   | Stop:RDT-ID                  | Int64     | YES         |
   | Stop:Station code            | Utf8      | YES         |
   | Stop:Station name            | Utf8      | YES         |
   | Stop:Arrival time            | Utf8      | YES         |
   | Stop:Arrival delay           | Utf8      | YES         |
   | Stop:Arrival cancelled       | Utf8      | YES         |
   | Stop:Departure time          | Utf8      | YES         |
   | Stop:Departure delay         | Utf8      | YES         |
   | Stop:Departure cancelled     | Utf8      | YES         |
   +------------------------------+-----------+-------------+
   17 row(s) fetched.
   Elapsed 0.008 seconds.
   
   > describe 'services/services-2025-07.csv';
   +------------------------------+-----------+-------------+
   | column_name                  | data_type | is_nullable |
   +------------------------------+-----------+-------------+
   | Service:RDT-ID               | Int64     | YES         |
   | Service:Date                 | Date32    | YES         |
   | Service:Type                 | Utf8      | YES         |
   | Service:Company              | Utf8      | YES         |
   | Service:Train number         | Int64     | YES         |
   | Service:Completely cancelled | Boolean   | YES         |
   | Service:Partly cancelled     | Boolean   | YES         |
   | Service:Maximum delay        | Int64     | YES         |
   | Stop:RDT-ID                  | Int64     | YES         |
   | Stop:Station code            | Utf8      | YES         |
   | Stop:Station name            | Utf8      | YES         |
   | Stop:Arrival time            | Utf8      | YES         |
   | Stop:Arrival delay           | Utf8      | YES         |
   | Stop:Arrival cancelled       | Utf8      | YES         |
   | Stop:Departure time          | Utf8      | YES         |
   | Stop:Departure delay         | Utf8      | YES         |
   | Stop:Departure cancelled     | Utf8      | YES         |
   | Stop:Platform change         | Boolean   | YES         | <-- New columns!
   | Stop:Planned platform        | Utf8      | YES         |
   | Stop:Actual platform         | Utf8      | YES         |
   +------------------------------+-----------+-------------+
   20 row(s) fetched.
   Elapsed 0.005 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to