Vasily Fomin created ARROW-13438: ------------------------------------ Summary: [C++] Can't use StreamWriter with ToParquetSchema schema Key: ARROW-13438 URL: https://issues.apache.org/jira/browse/ARROW-13438 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 4.0.1 Reporter: Vasily Fomin
Hi there, First of all, I'm not sure if I'm doing this correctly, as it took a bit of reverse engineering to figure this out. I'm using Arrow 4.0.1 on Ubuntu with C++. I followed the streaming example and created: {code:cpp} #include <cassert> #include <chrono> #include <cstdint> #include <cstring> #include <ctime> #include <iomanip> #include <iostream> #include <utility> #include "arrow/io/file.h" #include "parquet/exception.h" #include "parquet/stream_reader.h" #include "parquet/stream_writer.h" std::shared_ptr<parquet::schema::GroupNode> GetSchema() { parquet::schema::NodeVector fields; fields.push_back(parquet::schema::PrimitiveNode::Make( "int64_field", parquet::Repetition::OPTIONAL, parquet::Type::INT64, parquet::ConvertedType::NONE)); return std::static_pointer_cast<parquet::schema::GroupNode>( parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields)); } int main() { std::shared_ptr<arrow::io::FileOutputStream> outfile; PARQUET_ASSIGN_OR_THROW( outfile, arrow::io::FileOutputStream::Open("parquet-stream-api-example.parquet")); parquet::WriterProperties::Builder builder; parquet::StreamWriter os{parquet::ParquetFileWriter::Open(outfile, GetSchema(), builder.build())}; os << int64_t(10); return 0; } {code} The code terminates with: {code:java} terminate called after throwing an instance of 'parquet::ParquetException' what(): Column converted type mismatch. Column 'int64_field' has converted type[NONE] not 'INT_64' {code} What I'm not sure about is {{parquet::ConvertedType::NONE}} part. The example provides this value even for primitives, while it's my understanding that it's necessary? If I do provide it, the code works. Now, to the reverse engineering part. I'm trying to write to Parquet using {{StreamWriter}}. {{StreamWriter}} requires {{parquet::schema::{{GroupNode}}}} as the schema, but I begin with {{arrow::Schema}} I [found|https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L442] that it can be converted to {{{{parquet::SchemaDescriptor}}}} using {{parquet::arrow::ToParquetSchema }}utility. Looking at the utility [implementation|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema.cc#L322] I can see that {{logical_type}} is set to {{None}} which equals to {{parquet::ConvertedType::None}} and hence the converted schema can't be used due to the issue I described above. # Do we need to provide {{ConvertedType}} even for primitives? # Is it a bug in the schema conversion utility or [ColumnCheck|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_writer.cc#L200] assert? # Or is it expected behavior, in this case, what's a suggested approach? Build Parquet schema instead of Arrow Schema? Thank you, Vasily. -- This message was sent by Atlassian Jira (v8.3.4#803005)