Whenever drill encounters a corrupted parquet file it will stop processing a query.
To work around this issue I'm trying to write a simple tool to detect corrupted parquet files so that we can remove them from the pool of files drill will query on. I'm basically doing a HEAD command like was done in the parquet tools project. https://github.com/Parquet/parquet-mr/blob/master/parquet-tools/src /main/java/parquet/tools/command/HeadCommand.java PrintWriter writer = new PrintWriter(Main.out, true); reader = new ParquetReader<SimpleRecord>(new Path(input), new SimpleReadSupport()); for (SimpleRecord value = reader.read(); value != null && num-- > 0; value = reader.read()) { value.prettyPrint(writer); writer.println(); } However when I run this on a valid parquet file in HDFS it fails. It works fine if the file is on local disk. I'm getting this error: can not read class org.apache.parquet.format. PageHeader: Required field 'uncompressed_page_size' was not found in the serialized data! I've narrowed down the issue with the DFSInputStream.read(ByteBuffer). This method gets called to read the entire file into the ByteBuffer. This works fine when the file is local but when it is in HDSF. when the file is local it uses FSInputStream.read(ByteBuffer). Instead of reading the entire file it reads only 64k. The rest of the ByteBuffer is all 0. I've read that 64k is the default chunk size used by the DFSClient. Seems related.. Any suggestions or ideas why the method does not read all the bytes requested. Thanks Jean-Claude