Hi Steve, 1. What data in a parquet file is covered by CRC checks, and are there > any blocks of data (footers, summaries etc) which aren't checksummed?
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642 I think has the best summary. My understanding is that thrift metadata data is not itself checksummed 2. I see that verification as set > by "parquet.page.verify-checksum.enabled" is false by default. Why > isn't it > on? is there a significant performance hit. Sorry I don't know the answer to this. On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran <[email protected]> wrote: > hi > > I am busy dealing with a bug where the Azure abfs connector can get the > prefetch data blocks of one thread/task overwritten by those of another > task whose input stream was closed while a prefetch was in progress. > https://issues.apache.org/jira/browse/HADOOP-18521 > > I have not been able to trigger any failures reading parquet data, > presumably because it's seek-heavy read patterns don't benefit from > prefetching much. > > Parquet also stores CRC checksums of pages of data written -which I need a > bit of help understanding. > > > 1. What data in a parquet file is covered by CRC checks, and are there > any blocks of data (footers, summaries etc) which aren't checksummed? > 2. I see that verification as set > by "parquet.page.verify-checksum.enabled" is false by default. Why > isn't it > on? is there a significant performance hit. > > > Thanks > > steve >
