thanks for that, especially the link

as the docs say
   * If enabled, this allows for disabling checksumming in HDFS if only a
few
   * pages need to be read.

i think there's a good case for turning it on as (a) there are lots of
other filesystems out there, including NTFS on windows laptops, *and*
there's the risk of corruption of data in flight from the hdfs data node
processes where the CRC checks place and the actual reader code.


On Thu, 1 Dec 2022 at 05:53, Micah Kornfield <[email protected]> wrote:

> Hi Steve,
>
> 1. What data in a parquet file is covered by CRC checks, and are there
> >    any blocks of data (footers, summaries etc) which aren't checksummed?
>
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642
> I think has the best summary.  My understanding is that thrift  metadata
> data is not itself checksummed
>
>  2. I see that verification as set
> >    by "parquet.page.verify-checksum.enabled" is false by default. Why
> > isn't it
> >    on? is there a significant performance hit.
>
> Sorry I don't know the answer to this.
>
> On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran <[email protected]
> >
> wrote:
>
> > hi
> >
> > I am busy dealing with a bug where the Azure abfs connector can get the
> > prefetch data blocks of one thread/task overwritten by those of another
> > task whose input stream was closed while a prefetch was in progress.
> > https://issues.apache.org/jira/browse/HADOOP-18521
> >
> > I have not been able to trigger any failures reading parquet data,
> > presumably because it's seek-heavy read patterns don't benefit from
> > prefetching much.
> >
> > Parquet also stores CRC checksums of pages of data written -which I need
> a
> > bit of help understanding.
> >
> >
> >    1. What data in a parquet file is covered by CRC checks, and are there
> >    any blocks of data (footers, summaries etc) which aren't checksummed?
> >    2. I see that verification as set
> >    by "parquet.page.verify-checksum.enabled" is false by default. Why
> > isn't it
> >    on? is there a significant performance hit.
> >
> >
> > Thanks
> >
> > steve
> >
>

Reply via email to