Hi Naveen

Yes it sounds like it will help to disable metrics for those columns?
Iirc, by default it manifest entries have metrics at 'truncate(16)' level
for 100 columns, which as you see can be quite memory intensive.  A
potential improvement later also is to have the ability to remove counts by
config, though need to confirm if that is feasible.

Unfortunately today the new metrics config will only apply to new data
files  (you have to rewrite them all, or otherwise phase old data files
out).  I had a patch awhile back to add support for rewriting just manifest
with new metric config but was not merged yet, if any reviewer has time to
review, I can work on it again.
https://github.com/apache/iceberg/pull/2608
<https://github.com/apache/iceberg/pull/2608>

Thanks
Szehon

On Tue, May 21, 2024 at 1:43 AM Naveen Kumar <nk1...@gmail.com> wrote:

> Hi Everyone,
>
> I am looking into RewriteFiles
> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java>
> APIs and its implementation BaseRewriteFiles
> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>.
> Currently this works as following:
>
>    1. It accumulates all the files for addition and deletions.
>    2. At time of commit, it creates a new snapshot after adding all the
>    entries to corresponding manifest files.
>
> It has been observed that if the accumulated file objects are of huge size
> it takes a lot of memory.
> *eg*: Each dataFile object is of size *1KB*. Total accumulated(additions
> or deletions) size is *1 million. *
> Total memory consumed by *RewriteFiles* will be around *1GB*.
>
> Such dataset can happen with following reasons:
>
>    1. Table is very wide with say 1000 columns.
>    2. Most of the columns are of String data type, which can take more
>    space to store lower bound and upper bound.
>    3. Table has billions of records with millions of data files.
>    4. It is running data compaction procedures/jobs for the first time.
>    5. Or, Table was UN-partitioned and later evolved by new partition
>    columns.
>    6. Now it is trying to compact the table
>
> Attaching heap dump from one of the dataset while using API
>
>> RewriteFiles rewriteFiles(
>>     Set<DataFile> removedDataFiles,
>>     Set<DeleteFile> removedDeleteFiles,
>>     Set<DataFile> addedDataFiles,
>>     Set<DeleteFile> addedDeleteFiles)
>>
>>
> [image: Screenshot 2024-01-11 at 10.01.54 PM.png]
> We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT
> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>,
> which helps create smaller groups and multiple commits with configuration
> PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>.
> Currently engines like SPARK can follow this strategy. Since SPARK is
> running all the compaction jobs concurrently, there are chances many jobs
> can land on the same machines and accumulate with high memory usage.
>
> My question is, can we make these implementations
> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better
> to avoid any heap pressure? Also, has someone encountered similar issues
> and if so how did they fix it?
>
> Regards,
> Naveen Kumar
>
>

Reply via email to