Hi,
I've been trying to read data from a Parquet file into a stream using the
Parquet::StreamReader class for a while. The first column of my data
consists of int64s - thus, I have been streaming data as follows:
shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(datapath));
parquet::StreamReader stream{ parquet::ParquetFileReader::Open(infile) };
int64_t c1;
while (!stream.eof()) {
stream >> c1;
stream.SkipColumns(100);
stream >> parquet::EndRow;
cout << c1 << endl;
My code throws a ParquetException in the CheckColumn() function when
comparing length and node->type_length() [stream_reader.cc, Line 543]:
if (length != node->type_length()) {
throw ParquetException("Column length mismatch. Column '" + node->name() +
"' has length " +
std::to_string(node->type_length()) +
"] not " + std::to_string(length));
}
I figured out that this was because there are empty data fields in my
parquet, meaning length is 0 but node->type_length() is 64. I've looked all
over the internet trying to find a way to properly handle empty values in
parquet files using Arrow, but have had no luck. Is there a way to check if
a data field is empty for a Parquet::StreamReader object, or some other way
to manage empty fields?
Any help would be appreciated.