andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985775205


   @houqp After more debugging and fixing different things I found that the 
physical plan lacks the nested fields support. 
   
   I got into this error: 
   ```
   Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 
36 cols but got 6"))
   ```
   
   And this error is happening in these lines of code: 
[physical_plan/file_format/mod.rs#L223-L229](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/mod.rs#L223-L229).
 The chunk of data that has been read has only 6 columns while the expected 
number of columns is 36.
   
   The root cause seems to be the way parquet files are read vs how it gets 
projected. It reads one top nested column at a time, while it tries to project 
that chunk of data over the full schema. For example, in the case of the 
`nested_struct.rust.parquet` it reads the first column with 6 leaves and then 
tries to project that over all 36 top columns of that parquet file. This is 
root cause of the error above.
   
   It seems that DataFusion lacks the support for nested fields, at least when 
using the parquet data source.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to