Hi all,

>From the comments on the [EXTERNAL] Parquet metadata
<https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0>
document,
it appears there's a general consensus on most aspects, with the exception
of the relative 32-bit offsets for column chunks.

I'm starting this thread to discuss this topic further and work towards a
resolution. Adam Reeve suggested raising the limitation to 2^32, and he
confirmed that Java does not have any issues with this. I am open to this
change as it increases the limit without introducing any drawbacks.

However, some still feel that a 2^32-byte limit for a row group is too
restrictive. I'd like to understand these specific use cases better. From
my perspective, for most engines, the row group is the primary unit of
skipping, making very large row groups less desirable. In our fleet's
workloads, it's rare to see row groups larger than 100MB, as anything
larger tends to make statistics-based skipping ineffective.

Cheers,

Reply via email to