Hi Steve,

1. What data in a parquet file is covered by CRC checks, and are there
>    any blocks of data (footers, summaries etc) which aren't checksummed?

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642
I think has the best summary.  My understanding is that thrift  metadata
data is not itself checksummed

 2. I see that verification as set
>    by "parquet.page.verify-checksum.enabled" is false by default. Why
> isn't it
>    on? is there a significant performance hit.

Sorry I don't know the answer to this.

On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran <[email protected]>
wrote:

> hi
>
> I am busy dealing with a bug where the Azure abfs connector can get the
> prefetch data blocks of one thread/task overwritten by those of another
> task whose input stream was closed while a prefetch was in progress.
> https://issues.apache.org/jira/browse/HADOOP-18521
>
> I have not been able to trigger any failures reading parquet data,
> presumably because it's seek-heavy read patterns don't benefit from
> prefetching much.
>
> Parquet also stores CRC checksums of pages of data written -which I need a
> bit of help understanding.
>
>
>    1. What data in a parquet file is covered by CRC checks, and are there
>    any blocks of data (footers, summaries etc) which aren't checksummed?
>    2. I see that verification as set
>    by "parquet.page.verify-checksum.enabled" is false by default. Why
> isn't it
>    on? is there a significant performance hit.
>
>
> Thanks
>
> steve
>

Reply via email to