Good questions! The idea is to be able to skip rowgroups based on index.
But, if we have to do a full snapshot load, then our wrapper should actually
be doing batch GET on S3. Why incur 5x more calls.
As for the update, I think this is in the context of COW. So, the footer
will be
recomputed anyways, so handling updates should not be that tricky.

Regards,
Sagar

On Thu, Jul 20, 2023 at 3:26 PM nicolas paris <nicolas.pa...@riseup.net>
wrote:

> Hi,
>
> Multiple idenpendant initiatives for fast copy on write have emerged
> (correct me if I am wrong):
> 1.
>
> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
> 2.
> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>
>
> The idea is to rely on RLI index to target only some row groups in a
> given parquet file, and only serde that one when copying the file
>
> Currently hudi generates one row group per parquet file (and having
> large row group is what parquet and other advocates).
>
> The FCOW feature then need to use several row group per parquet to
> provide some benefit, let's say 30MB as mentionned in the rfc68
> discussion.
>
> I have concerns about using small row groups for read performances such
> as :
> - more s3 throttle: if we have 5x more row group in a parquet files,
> then it leads to 5x GET call
> - worst read performances: since largest row group leads to better
> performances overall
>
>
> As a side question, I wonder how the writer can keep statistics within
> parquet footer correct. If updates occurs somewhere, then the below
> stuff present in the footer shall be updated accordingly:
> - parquet row group/pages stats
> - parquet dictionary
> - parquet bloom filters
>
> Thanks for your feedback on those
>

Reply via email to