>
> i think there's a good case for turning it on as (a) there are lots of
> other filesystems out there, including NTFS on windows laptops, *and*
> there's the risk of corruption of data in flight from the hdfs data node
> processes where the CRC checks place and the actual reader code.


Yep, agree it seems reasonable, like I said, I'm not sure about actual
performance implications.

On Fri, Dec 2, 2022 at 8:22 AM Steve Loughran <[email protected]>
wrote:

> thanks for that, especially the link
>
> as the docs say
>    * If enabled, this allows for disabling checksumming in HDFS if only a
> few
>    * pages need to be read.
>
> i think there's a good case for turning it on as (a) there are lots of
> other filesystems out there, including NTFS on windows laptops, *and*
> there's the risk of corruption of data in flight from the hdfs data node
> processes where the CRC checks place and the actual reader code.
>
>
> On Thu, 1 Dec 2022 at 05:53, Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Steve,
> >
> > 1. What data in a parquet file is covered by CRC checks, and are there
> > >    any blocks of data (footers, summaries etc) which aren't
> checksummed?
> >
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642
> > I think has the best summary.  My understanding is that thrift  metadata
> > data is not itself checksummed
> >
> >  2. I see that verification as set
> > >    by "parquet.page.verify-checksum.enabled" is false by default. Why
> > > isn't it
> > >    on? is there a significant performance hit.
> >
> > Sorry I don't know the answer to this.
> >
> > On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran
> <[email protected]
> > >
> > wrote:
> >
> > > hi
> > >
> > > I am busy dealing with a bug where the Azure abfs connector can get the
> > > prefetch data blocks of one thread/task overwritten by those of another
> > > task whose input stream was closed while a prefetch was in progress.
> > > https://issues.apache.org/jira/browse/HADOOP-18521
> > >
> > > I have not been able to trigger any failures reading parquet data,
> > > presumably because it's seek-heavy read patterns don't benefit from
> > > prefetching much.
> > >
> > > Parquet also stores CRC checksums of pages of data written -which I
> need
> > a
> > > bit of help understanding.
> > >
> > >
> > >    1. What data in a parquet file is covered by CRC checks, and are
> there
> > >    any blocks of data (footers, summaries etc) which aren't
> checksummed?
> > >    2. I see that verification as set
> > >    by "parquet.page.verify-checksum.enabled" is false by default. Why
> > > isn't it
> > >    on? is there a significant performance hit.
> > >
> > >
> > > Thanks
> > >
> > > steve
> > >
> >
>

Reply via email to