[ 
https://issues.apache.org/jira/browse/ARROW-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16642:
--------------------------------
    Summary: [C++] An Error Occured While Reading Parquet File Using C++ - 
GetRecordBatchReader -Corrupt snappy compressed data.   (was: An Error Occured 
While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy 
compressed data. )

> [C++] An Error Occured While Reading Parquet File Using C++ - 
> GetRecordBatchReader -Corrupt snappy compressed data. 
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16642
>                 URL: https://issues.apache.org/jira/browse/ARROW-16642
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 8.0.0
>         Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
> pyarrow 7.0.0 ubuntu 9.4.0  python3.8,
>            Reporter: yurikoomiga
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: test_std_02.py
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi All
> When I use Arrow Reading Parquet File like follow:
> ```
> auto st = parquet::arrow::FileReader::Make(
>                     arrow::default_memory_pool(),
>                     parquet::ParquetFileReader::Open(_parquet, _properties), 
> &_reader);   
> arrow::Status status = 
> _reader->GetRecordBatchReader(\{_current_group},_parquet_column_ids, 
> &_rb_batch);    
> _reader->set_batch_size(65536);       
> _reader->set_use_threads(true);      
> status = _rb_batch->ReadNext(&_batch); `
> ``` 
> status is not ok and an error occured like this:
> `IOError: Corrupt snappy compressed data.`
> When I comment out this statement ` _reader->set_use_threads(true);`,The 
> program runs normally and I can read parquet file well.
> Program errors only occur when I read multiple columns and using 
> `_reader->set_use_threads(true); `and a single column will not occur error
> The testing parquet file is created by pyarrow,I use only 1 group and each 
> group has 3000000 records.
> The parquet file has 20 columns including int and string types
> you can create a test parquet file using attachment python script
> In my case,I read 0,1,2,3,4,5,6 index columns
> Reading file using C++,arrow 7.0.0 ,snappy 1.1.8
> Writting file using python3.8 ,pyarrow 7.0.0
> Looking forward to your reply
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to