[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643440#comment-16643440 ]
Wes McKinney commented on PARQUET-1438: --------------------------------------- Are you using the _released_ version of 1.5.0 or some other version? There should be little discrepancy between the code in parquet-cpp 1.5.0 and what's in master now > [C++] corrupted files produced on 32-bit architecture (i686) > ------------------------------------------------------------ > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug > Reporter: Dmitry Kalinkin > Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x00007fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced<float>(float const*, float*, > int, int, unsigned char const*, long) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x00007fffd17c8025 in > parquet::DictionaryDecoder<parquet::DataType<(parquet::Type::type)4> > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x00007fffd17bcf0f in > parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)4> > >::ReadRecordData(long) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x00007fffd17bfbea in > parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)4> > >::ReadRecords(long) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x00007fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr<arrow::Array>*) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x00007fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr<arrow::Array>*) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x00007fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector<int, > std::allocator<int> > const&, std::shared_ptr<arrow::Array>*) () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x00007fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector<int, > std::allocator<int> > const&, > std::shared_ptr<arrow::Table>*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)