thanks for that, especially the link as the docs say * If enabled, this allows for disabling checksumming in HDFS if only a few * pages need to be read.
i think there's a good case for turning it on as (a) there are lots of other filesystems out there, including NTFS on windows laptops, *and* there's the risk of corruption of data in flight from the hdfs data node processes where the CRC checks place and the actual reader code. On Thu, 1 Dec 2022 at 05:53, Micah Kornfield <[email protected]> wrote: > Hi Steve, > > 1. What data in a parquet file is covered by CRC checks, and are there > > any blocks of data (footers, summaries etc) which aren't checksummed? > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642 > I think has the best summary. My understanding is that thrift metadata > data is not itself checksummed > > 2. I see that verification as set > > by "parquet.page.verify-checksum.enabled" is false by default. Why > > isn't it > > on? is there a significant performance hit. > > Sorry I don't know the answer to this. > > On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran <[email protected] > > > wrote: > > > hi > > > > I am busy dealing with a bug where the Azure abfs connector can get the > > prefetch data blocks of one thread/task overwritten by those of another > > task whose input stream was closed while a prefetch was in progress. > > https://issues.apache.org/jira/browse/HADOOP-18521 > > > > I have not been able to trigger any failures reading parquet data, > > presumably because it's seek-heavy read patterns don't benefit from > > prefetching much. > > > > Parquet also stores CRC checksums of pages of data written -which I need > a > > bit of help understanding. > > > > > > 1. What data in a parquet file is covered by CRC checks, and are there > > any blocks of data (footers, summaries etc) which aren't checksummed? > > 2. I see that verification as set > > by "parquet.page.verify-checksum.enabled" is false by default. Why > > isn't it > > on? is there a significant performance hit. > > > > > > Thanks > > > > steve > > >
