Vasily Fomin created ARROW-13438:
------------------------------------

             Summary: [C++] Can't use StreamWriter with ToParquetSchema schema
                 Key: ARROW-13438
                 URL: https://issues.apache.org/jira/browse/ARROW-13438
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 4.0.1
            Reporter: Vasily Fomin


Hi there,

First of all, I'm not sure if I'm doing this correctly, as it took a bit of 
reverse engineering to figure this out. 

I'm using Arrow 4.0.1 on Ubuntu with C++.

I followed the streaming example and created:
{code:cpp}
#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstring>
#include <ctime>
#include <iomanip>
#include <iostream>
#include <utility>

#include "arrow/io/file.h"
#include "parquet/exception.h"
#include "parquet/stream_reader.h"
#include "parquet/stream_writer.h"

std::shared_ptr<parquet::schema::GroupNode> GetSchema() {
  parquet::schema::NodeVector fields;
  fields.push_back(parquet::schema::PrimitiveNode::Make(
      "int64_field", parquet::Repetition::OPTIONAL, parquet::Type::INT64,
      parquet::ConvertedType::NONE));

  return std::static_pointer_cast<parquet::schema::GroupNode>(
      parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, 
fields));
}

int main() {
  std::shared_ptr<arrow::io::FileOutputStream> outfile;

  PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("parquet-stream-api-example.parquet"));

  parquet::WriterProperties::Builder builder;
  parquet::StreamWriter os{parquet::ParquetFileWriter::Open(outfile, 
GetSchema(), builder.build())};

  os << int64_t(10);

  return 0;
}
{code}
The code terminates with:
{code:java}
terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Column converted type mismatch.  Column 'int64_field' has converted 
type[NONE] not 'INT_64' {code}
What I'm not sure about is {{parquet::ConvertedType::NONE}} part. The example 
provides this value even for primitives, while it's my understanding that it's 
necessary? If I do provide it, the code works.

Now, to the reverse engineering part. I'm trying to write to Parquet using 
{{StreamWriter}}. {{StreamWriter}} requires {{parquet::schema::{{GroupNode}}}} 
as the schema, but I begin with {{arrow::Schema}} I 
[found|https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L442]
 that it can be converted to {{{{parquet::SchemaDescriptor}}}} using 
{{parquet::arrow::ToParquetSchema }}utility. Looking at the utility 
[implementation|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema.cc#L322]
 I can see that {{logical_type}} is set to {{None}} which equals to 
{{parquet::ConvertedType::None}} and hence the converted schema can't be used 
due to the issue I described above.
 # Do we need to provide {{ConvertedType}} even for primitives?
 # Is it a bug in the schema conversion utility or 
[ColumnCheck|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_writer.cc#L200]
 assert?
 # Or is it expected behavior, in this case, what's a suggested approach? Build 
Parquet schema instead of Arrow Schema?

Thank you,

Vasily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to