andrei-ionescu opened a new issue #1383:
URL: https://github.com/apache/arrow-datafusion/issues/1383


   **Describe the bug**
   
   Reading wide and nested parquet files results in `index out of bounds` error 
as seen bellow:
   
   ```
   thread 'main' panicked at 'index out of bounds: the len is 17 but the index 
is 17', /Users/xxxx/.cargo/registry/
       
src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
   ```
   
   **To Reproduce**
   
   1. Download attached zipped parquet file and unzip it: 
[wide_schema_1row.parquet.zip](https://github.com/apache/arrow-datafusion/files/7621520/wide_schema_1row.parquet.zip)
   2. Place it in a `./data` folder
   3. Execute the following code:
   
   ```rust
   let mut ctx = ExecutionContext::new(); 
   let df = ctx.read_parquet("./data/wide_schema_1row.parquet").await?;
   df.show().await
   ```
   
   4. The result is `index out of bounds` panic
   
   ```
   thread 'main' panicked at 'index out of bounds: the len is 17 but the index 
is 17', 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
   stack backtrace:
      0: rust_begin_unwind
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
      1: core::panicking::panic_fmt
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
      2: core::panicking::panic_bounds_check
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:75:5
      3: <usize as core::slice::index::SliceIndex<[T]>>::index_mut
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:190:14
      4: core::slice::index::<impl core::ops::index::IndexMut<I> for 
[T]>::index_mut
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:26:9
      5: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2540:9
      6: datafusion::datasource::file_format::parquet::fetch_metadata
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
      7: <datafusion::datasource::file_format::parquet::ParquetFormat as 
datafusion::datasource::file_format::FileFormat>::infer_schema::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:96:27
      8: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
      9: <core::pin::Pin<P> as core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/future.rs:119:9
     10: 
datafusion::datasource::listing::table::ListingOptions::infer_schema::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/listing/table.rs:99:27
     11: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
     12: 
datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet_with_name::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:287:31
     13: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
     14: 
datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:255:9
     15: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
     16: 
datafusion::execution::context::ExecutionContext::read_parquet::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/execution/context.rs:403:13
     17: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
     18: read_parquet::main::{{closure}}
                at ./src/main.rs:79:14
     19: <core::future::from_generator::GenFuture<T> as 
core::future::future::Future>::poll
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
     20: tokio::park::thread::CachedParkThread::block_on::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:54
     21: tokio::coop::with_budget::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
     22: std::thread::local::LocalKey<T>::try_with
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:399:16
     23: std::thread::local::LocalKey<T>::with
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:375:9
     24: tokio::coop::with_budget
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
     25: tokio::coop::budget
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
     26: tokio::park::thread::CachedParkThread::block_on
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:31
     27: tokio::runtime::enter::Enter::block_on
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:151:13
     28: tokio::runtime::thread_pool::ThreadPool::block_on
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/mod.rs:77:9
     29: tokio::runtime::Runtime::block_on
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/mod.rs:463:43
     30: read_parquet::main
                at ./src/main.rs:80:5
     31: core::ops::function::FnOnce::call_once
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/ops/function.rs:227:5
   ```
   
   **Expected behavior**
   
   To properly read the parquet file.
   
   **Additional context**
   
   After debugging a bit the issue the error happens in `fetch_statistics` 
function. To be more precise the `schema.fields().len()` 
[datasource/file_format/parquet.rs#L261](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/file_format/parquet.rs#L261)
 construct returns only the top fields, while the `row_group_meta.columns()` 
([datasource/file_format/parquet.rs#L276-L277](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/file_format/parquet.rs#L276-L277))
 returns all leaves.
   
   In the context of the given parquet file, there are 17 top level fields and 
about 262 leaves.
   
   DataFusion is `6.0`
   Rust is `1.58.0-nightly (65c55bf93 2021-11-23)`
   Cargo is `1.58.0-nightly (e1fb17631 2021-11-22)`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to