detecting corrupted parquet files

Jean-Claude Cote Wed, 23 Mar 2016 04:22:18 -0700

Whenever drill encounters a corrupted parquet file it will stop processing
a query.


To work around this issue I'm trying to write a simple tool to detect
corrupted parquet files so that we can remove them from the pool of files
drill will query on.

I'm basically doing a HEAD command like was done in the parquet tools
project.

https://github.com/Parquet/parquet-mr/blob/master/parquet-tools/src
/main/java/parquet/tools/command/HeadCommand.java

PrintWriter writer = new PrintWriter(Main.out, true);
  reader = new ParquetReader<SimpleRecord>(new Path(input), new
SimpleReadSupport());
  for (SimpleRecord value = reader.read(); value != null && num-- > 0;
value = reader.read()) {
    value.prettyPrint(writer);
    writer.println();
  }

However when I run this on a valid parquet file in HDFS it fails. It works
fine if the file is on local disk.

I'm getting this error: can not read class org.apache.parquet.format.
PageHeader: Required field 'uncompressed_page_size' was not found in the
serialized data!

I've narrowed down the issue with the DFSInputStream.read(ByteBuffer). This
method gets called to read the entire file into the ByteBuffer. This works
fine when the file is local but when it is in HDSF. when the file is local
it uses FSInputStream.read(ByteBuffer).

Instead of reading the entire file it reads only 64k. The rest of the
ByteBuffer is all 0. I've read that 64k is the default chunk size used by
the DFSClient. Seems related.. Any suggestions or ideas why the method does
not read all the bytes requested.

Thanks
Jean-Claude

detecting corrupted parquet files

Reply via email to