Hi,

I've been trying to read data from a Parquet file into a stream using the
Parquet::StreamReader class for a while. The first column of my data
consists of int64s - thus, I have been streaming data as follows:

    shared_ptr<arrow::io::ReadableFile> infile;
    PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(datapath));
    parquet::StreamReader stream{ parquet::ParquetFileReader::Open(infile) };

    int64_t c1;

    while (!stream.eof()) {
        stream >> c1;
        stream.SkipColumns(100);
        stream >> parquet::EndRow;

        cout << c1 << endl;

My code throws a ParquetException in the CheckColumn() function when
comparing length and node->type_length() [stream_reader.cc, Line 543]:

  if (length != node->type_length()) {
    throw ParquetException("Column length mismatch.  Column '" + node->name() +
                           "' has length " +
std::to_string(node->type_length()) +
                           "] not " + std::to_string(length));
  }

I figured out that this was because there are empty data fields in my
parquet, meaning length is 0 but node->type_length() is 64. I've looked all
over the internet trying to find a way to properly handle empty values in
parquet files using Arrow, but have had no luck. Is there a way to check if
a data field is empty for a Parquet::StreamReader object, or some other way
to manage empty fields?

Any help would be appreciated.

Reply via email to