[jira] [Commented] (ARROW-16642) [C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Weston Pace (Jira) Wed, 25 May 2022 14:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542245#comment-17542245
 ]


Weston Pace commented on ARROW-16642:
-------------------------------------

You might need to provide a few more details on how you are reading the parquet 
file.  I used the python script you provided to create a file 
{{/home/pace/test.parquet}} which I then tested with this script:

{noformat}
#include <iostream>

#include "arrow/filesystem/api.h"
#include "arrow/record_batch.h"

#include "parquet/api/reader.h"
#include "parquet/arrow/reader.h"

int main() {
  auto fs = std::make_unique<arrow::fs::LocalFileSystem>();
  auto input_file = fs->OpenInputFile("/home/pace/test.parquet").ValueOrDie();

  std::unique_ptr<parquet::arrow::FileReader> file_reader;
  arrow::Status st = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(),
      parquet::ParquetFileReader::Open(input_file), &file_reader);
  if (!st.ok()) {
    std::cerr << "Error making file reader: " << st << std::endl;
    return -1;
  }
  std::vector<int> parquet_column_ids = {0, 1, 2, 3, 4, 5, 6};
  std::cout << "The file has " << file_reader->num_row_groups() << " row groups"
            << std::endl;
  for (int row_group_idx = 0; row_group_idx < file_reader->num_row_groups();
       row_group_idx++) {
    std::cout << "Reading row group: " << row_group_idx << std::endl;
    std::shared_ptr<arrow::RecordBatchReader> record_batch_reader;
    st = file_reader->GetRecordBatchReader({row_group_idx}, parquet_column_ids,
                                           &record_batch_reader);
    file_reader->set_batch_size(65536);
    file_reader->set_use_threads(true);
    std::shared_ptr<arrow::RecordBatch> batch;
    while (true) {
      st = record_batch_reader->ReadNext(&batch);
      if (st.ok()) {
        if (!batch) {
          // Reached the end of the row group
          break;
        }
        std::cout << "  Read in record batch with " << batch->num_rows()
                  << " rows" << std::endl;
      } else {
        std::cerr << "Error encountered reading record batch: " << st
                  << std::endl;
        return -2;
      }
    }
  }
}
{noformat}

I did not get any errors and got the expected output:

{noformat}
The file has 1 row groups
Reading row group: 0
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 50880 rows
{noformat}

Does my test program work in your environment?

> [C++] An Error Occured While Reading Parquet File Using C++ - 
> GetRecordBatchReader -Corrupt snappy compressed data. 
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16642
>                 URL: https://issues.apache.org/jira/browse/ARROW-16642
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 8.0.0
>         Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
> pyarrow 7.0.0 ubuntu 9.4.0  python3.8,
>            Reporter: yurikoomiga
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: test_std_02.py
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi All
> When I use Arrow Reading Parquet File like follow:
>  
> {code:java}
> auto st = parquet::arrow::FileReader::Make(
>                     arrow::default_memory_pool(),
>                     parquet::ParquetFileReader::Open(_parquet, _properties), 
> &_reader);   
> arrow::Status status = 
> _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, 
> &_rb_batch);    
> _reader->set_batch_size(65536);       
> _reader->set_use_threads(true);      
> status = _rb_batch->ReadNext(&_batch);  {code}
> {color:#172b4d}status is not ok and an error occured like this:{color}
> {code:java}
> IOError: Corrupt snappy compressed data. {code}
> When I comment out this statement
> {code:java}
>  _reader->set_use_threads(true);{code}
> The program runs normally and I can read parquet file well.
> Program errors only occur when I read multiple columns and using 
> _reader->set_use_threads(true); and a single column will not occur error
> The testing parquet file is created by pyarrow，I use only 1 group and each 
> group has 3000000 records.
> The parquet file has 20 columns including int and string types
> you can create a test parquet file using attachment python script
> In my case,I read 0,1,2,3,4,5,6 index columns
> Reading file using C++,arrow 7.0.0 ,snappy 1.1.8
> Writting file using python3.8 ,pyarrow 7.0.0
> Looking forward to your reply
> Thank you!
> [~apitrou] 
> [~westonpace] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16642) [C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Reply via email to