Re: [DISCUSS] Handling of existing statistics files

Gábor Kaszab Fri, 03 Jul 2026 04:39:04 -0700

One correction on the breaking change part:
Actually, we'd somewhat break the contract offered by the Java API here
<https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/UpdateStatistics.java#L38>.
It explicitly says that setStatistics is supposed to replace existing stats
for the snapshot.


Additional thought:
When running compute_table_stats procedure 100 times through Spark, the
expectation is that we end up with a single stat file for the given
snapshot, not 100 ones. Changing this logic is not that straightforward,
because how would we know which stat file to keep and which one to replace.
This makes implementing the suggested approach even more complicated.

Gabor

Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. júl. 3., P,
13:30):

> Hi Dzeri,
>
> I entertained the idea of having multiple statistics files for the same
> snapshot ID and I did some reading on this to figure out what it would take
> to implement. Here are my findings:
>
> *Spec*
> I checked the spec first to see if there are any restrictions with the
> physical representation and apparently there is a good amount of
> flexibility on this front. In table metadata
> <https://iceberg.apache.org/spec/#table-metadata-fields> 'statistics' is
> a list of table statistics, where table statistics
> <https://iceberg.apache.org/spec/#table-statistics> in turn keeps the
> snapshot ID and all the other metadata fields, e.g. path to Puffin file,
> etc.
> There is no constraint that we can't have multiple entries in the list
> with the same snapshot ID.
>
> *Read path in reference implementation (Java)*
> We use these stats in SparkScan.estimateStatistics(snapshot)
> <https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L230>
> where we find the first element in the list with the given snapshot ID.
> This would continue to work fine. Issue might occur in ecosystems where
> proprietary writers write proprietary stat objects into this list where the
> expected regular stats by Spark might not be the first in the list. In this
> case Spark won't use stats, might require a proprietary implementation
> within Spark to find the one it needs. But that's the problem of those
> environments I guess, open source should be fine :)
>
> *Write path in reference implementation (Java)*
> TableMetadata as an in-memory representation keeps a List<Statistics>
> similarly to the recommendation of the spec.
> What prevents today from making this work is the Builder for TableMetadata
> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1375>
> that is stricter than the spec and every time we add a new statistics to a
> snapshot, it wipes out the existing ones. Loosening on this part would do
> the trick to allow multiple stats files for the same snapshot ID.
>
> *The downsides of such design:*
> 1) It's more complicated to overwrite existing stats
> Currently, you only have to set a new stat to the snapshot
> (UpdateStatistics.setStatitics(statForSnapshotId)) to overwrite the
> existing one, but with the change this would be a 2-step process: First
> UpdateStatistics.removeStatistics(snapshotId).commit() and then
> UpdateStatistics.setStatistics(statForSnapshotId).commit(). It's not even
> possible to do it in one go currently (however, feasible to implement).
> 2) Number of tracked files growing
> 3) Is it a breaking change to alter the behaviour from "overwrite" to
> "append" when adding stats?
> We wouldn't break anything in the spec, though, just loosen the strictness
> of a reference implementation.
>
> These are my takes, I'd like to hear what others think on this.
>
> Best Regards,
> Gabor Kaszab
>
> dzeri96 <[email protected]> ezt írta (időpont: 2026. júl. 2., Cs, 13:32):
>
>> Alright, both of you raised some valid points.
>>
>> Gabor drew my attention to the list of "supported" blob types in the spec
>> and it made me realize that the people who implemented deletion vectors
>> kind of had the same problem we're discussing. They had puffin files which
>> could not share the same lifecycle as regular statistics files as it is
>> now, so they stored them in manifest files. This is perfectly reasonable,
>> but it just takes some time to get the mental model right and separate
>> statistics files from the iceberg puffin spec. Even still, this distinction
>> stays blurry because the supported blob types are defined at a lower level
>> than they should be in my opinion. Reading Datasketches and deletion vector
>> data happens at two separate points of the read process. In my opinion,
>> it's unlikely that someone is writing a single instance of an "iceberg
>> puffin reader", so blob types should be defined at their respective place
>> of access. For example, the deletion vector type should be defined together
>> with manifest files.
>>
>> The reason why I'm saying this is not just to re-organize the
>> documentation, but to make a case that statistics files should be handled
>> more like large properties files with arbitrary information, than a fixed
>> spec that has to be interpretable by every reader. And what do we do with
>> custom properties currently? We carry them over, no matter the change. I'm
>> not the only person that understood statistics files like this. Dremio, for
>> example, had an article (
>> https://www.dremio.com/blog/extending-apache-iceberg-best-practices-for-storing-and-discovering-custom-metadata/)
>> on this, and it coincides with my view. The way they solved our problem is
>> by adding a pointer to their latest puffin file in the table properties. I
>> don't like this because it doesn't work with time travel and breaks file
>> cleanup tasks.
>>
>> Going back to your feedback, I think a realistic approach to achieving
>> what I suggested is to indeed have multiple puffin files per snapshot. That
>> way, we solve the write amplification and the commit problem (though I
>> believe there still are edge cases, just as there are with custom
>> properties). The way we handle proprietary metadata is either by going
>> trough the blob types in order to find a supported file, or by adding a new
>> file-level field that saves the file's spec. One of these specs would be
>> the "iceberg statistics spec" with datasketches, and everything else is
>> left to the reader for interpretation. Versions would be handled by the
>> respective spec, just like iceberg does now.
>>
>> Overall, I think it's important to reaffirm that my goal is not for
>> everyone to understand each-other. It's just to prevent systems from
>> deleting each-others' metadata. An added benefit would be having
>> snapshot-specific custom properties. I think that's a pretty nice bonus.
>>
>> Let me know what you think,
>> Dzeri
>>
>> On Wednesday, July 1st, 2026 at 1:01 PM, Gábor Kaszab <
>> [email protected]> wrote:
>>
>> > Hi Dzeri and Tamas,
>> > Thank you for raising this and sharing your opinion! I'm not entirely
>> sure about the overall conclusions on the proposed way forward here,
>> though. Let me reflect to some of the details:
>> >
>> > Create new snapshot when computing statistics
>> > This is different from the model we use now. New snapshot is created
>> when data changes within the table. With adding stats, the data is intact
>> but additional metadata is added, so following the model, we don't create a
>> new snapshot. Also, statistics files are attached to a particular snapshot,
>> wouldn't be that intuitive to say, in snapshot X we added stats for
>> snapshot Y.
>> > Also, I'm not entirely sure I understand how this would help with
>> engine_A overwriting stats written by engine_B. Would there be snapshot_X
>> that contains stats for snapshot_Y written by engine_Z? Not something we'd
>> want.
>> >
>> > Allow for multiple statistics files to be bound to a snapshot
>> > With this design, how would we match stat files with engines? Would
>> there be a special string ID that describes the engine? Would there be a
>> different ID for different versions of the engine? Would these IDs be part
>> of the Iceberg spec, or would we rely on each engine knowing its own ID?
>> How would readers know which stat file to read? Would Impala version X.Y
>> know if it can read stat files written by Spark version A.B? I'm not sure
>> there is a good way of implementing this.
>> > Not convinced this is the way forward.
>> >
>> > Enforce carry-over of unknown blob data into new puffin files
>> > From the reader's perspective this should be fine, they can pick the
>> blobs they understand by blob type.
>> > From the writer's perspective I'd argue with "Simple to implement".
>> It's simple today, as whenever a writer creates stats for a snapshot, it
>> commits the computed stats for the snapshot unconditionally. With the
>> carry-over approach there are a couple of extra steps and difficulties:
>> > - If there are existing stats for the snapshot, they have to be read
>> > - After writing the merged stats to storage (the stats the writer
>> computed together with the blobs the writer doesn't know about) conflict
>> detection has to be performed before commit. Without this, in case some
>> other writer wrote some proprietary stuff we are expected to carry over, we
>> could lose that information. This requires not only conflict checks, but
>> conflict resolution, retries, etc. that complicates a process that is
>> pretty simple today.
>> >
>> > But most importantly, I'm very hesitant to introduce support
>> (potentially into the spec too) of proprietary stuff we don't understand.
>> > - Iceberg is foremost a specification (with a number of reference
>> implementations) that is powerful for cross-engine compatibility. This
>> means, whatever is in the spec is expected to be understood by engines
>> following the spec.
>> > - Adding proprietary stuff to stats files helps a subset of proprietary
>> engines only. The design allows putting whatever proprietary stuff into
>> Puffin files, but once done, it's up to the proprietary writers to take
>> care of it.
>> > - Once proprietary stuff made it into the Puffin files, I don't think
>> the spec should mandate engines to carry them over.
>> >
>> > Ways forward:
>> > 1) Use the proprietary writer to calculate stats
>> > In this particular case I assume there are 2 writers, one that writes
>> proprietary stuff to Puffin, another that follows the spec and calculates
>> stats into the same Puffin.
>> > In your description you mention you ran ANALYZE TABLE on your table,
>> but I don't think that it's valid for Iceberg tables. For Iceberg tables,
>> the compute_table_stats procedure is for creating the table-level Puffin
>> files. You either avoid using this through Spark, or you run this first and
>> after this you run your proprietary writer (I think this is what you said
>> you were doing) and then your Puffin is as you expect.
>> >
>> > 2) Standardize proprietary stuff
>> > As I mentioned, Iceberg is powerful for cross-engine compatibility.
>> Your proprietary stuff doesn't help other engines, hence not much point to
>> add support for the spec/table format to keep them. However, I think we can
>> examine what exactly you'd like to store in the Puffin files one-by-one and
>> then discuss if the community shows support to add them to the spec as
>> officially supported blob types. WDYT?
>> >
>> > Best Regards,
>> > Gabor
>> >
>> > Tamás Máté <[email protected]> ezt írta (időpont: 2026. jún. 30., K,
>> 19:41):
>> >
>> > > Hi Dzeri,
>> > >
>> > > Thanks for writing this up. I agree that the stats-only case is the
>> important one to separate from data-changing commits, and that replacement
>> is the tricky part.
>> > >
>> > > My mental model for the lifecycle is:
>> > >
>> > > 1. An engine analyzes an existing snapshot.
>> > > 2. It writes a statistics file for that snapshot.
>> > > 3. It commits a new table metadata version that references the
>> statistics file, without creating a new snapshot.
>> > > 4. That metadata entry is carried forward until the snapshot expires,
>> or until something explicitly replaces or removes it.
>> > >
>> > > A writer that does not understand a blob type has no basis to
>> validate it. Because of that, I think dropping an unknown blob is safer
>> than carrying it forward into a newly written statistics file, where it may
>> become obsolete or misleading. I think we should be aggressive about
>> replacing statistics, and users should be aware that running a
>> stats-producing operation may replace the statistics for that snapshot and
>> drop statistics written by another engine.
>> > >
>> > > I am also concerned about multiple statistics files per snapshot from
>> a planning-latency perspective. Spark's current column-statistics planning
>> path reads NDV from statistics file metadata, not by opening Puffin files
>> ([SparkScan.estimateStatistics](
>> https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L198-L245)).
>> If multiple files require Puffin reads during planning, that would
>> introduce a new planning-time I/O path. One thing to consider is whether
>> statistics loading should move to table load time instead. This is metadata
>> after all, and I would expect the relevant Puffin metadata to be small,
>> probably hundreds of KB at most.
>> > >
>> > > Overall, I think creating a snapshot when statistics are written is
>> the simplest model and makes the most sense to me. I would tie statistics
>> to snapshots so they can be expired with the snapshots they belong to.
>> Also, because a statistics file can already contain multiple blobs,
>> allowing multiple statistics files per snapshot feels similar to allowing
>> more blobs, but with extra file-level lifecycle and planning cost. Problems
>> could quickly arise if engine A drops stats X and Y, and then engine B,
>> which expects X, Y, and Z together, later finds only Z.
>> > >
>> > > What do you think?
>> > >
>> > > Best,
>> > > Tamas
>> > >
>> > > On Thu, 25 Jun 2026 at 12:10, dzeri96 <[email protected]> wrote:
>> > >
>> > > >
>> > > > Hi everyone,
>> > > >
>> > > > I've recently started a discussion on Slack and was advised to post
>> in the dev mailing list.
>> > > > As puffin/statistics files are starting to catch on, we are bound
>> to come across situations where one writer wants to create a new statistics
>> file while some data which it might not understand is already present in
>> the current snapshot's statistics file. I've come across this problem in
>> real life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new
>> metadata file and replaced my proprietary index data with its own.
>> > > > You could argue that a single type of writer is expected for a
>> table, but on the other hand, the spirit of Iceberg is portability. We
>> can't know who's accessing the table and possibly corrupting its
>> (statistics-)data.
>> > > >
>> > > > Before I get into the proposed solutions, I think it's important to
>> distinguish two scenarios in which statistics files are being written:
>> data-changing and non-data-changing.
>> > > > For data-changing scenarios, I think it's reasonable to assume that
>> old statistics files are no longer valid, and are therefore OK to replace.
>> In the rest of this email, I will focus on scenarios where statistics are
>> being generated and attached to the current snapshot via a new metadata
>> file, as these are the problematic ones.
>> > > >
>> > > > After a short discussion in Slack, we roughly see three possible
>> solutions. I think all of them require a change to the iceberg spec, but
>> with varying gravity:
>> > > >
>> > > > 1. Enforce carry-over of unknown blob data into new puffin files.
>> > > > Pros:
>> > > > - Backwards-compatible reads, not only in terms of the iceberg
>> spec, but also in terms of statistics files semantics.
>> > > > - Simple to implement because blob-level metadata is already
>> available.
>> > > > - One reader could potentially understand statistics blobs
>> calculated by different writers.
>> > > > Cons:
>> > > > - Write amplification.
>> > > > - Conflict resolution might require re-writing the whole file again.
>> > > >
>> > > > 2. Allow for multiple statistics files to be bound to a snapshot.
>> > > > Pros:
>> > > > - Avoids write amplification.
>> > > > - Each writer cares only about its own statistics file.
>> > > > - Finding relevant statistics files is easy thanks to file-level
>> metadata.
>> > > > - One reader could understand statistics files written by different
>> writers.
>> > > > Cons:
>> > > > - Backwards-incompatible reads.
>> > > >
>> > > > 3. Create new snapshot when computing statistics.
>> > > > Pros:
>> > > > - Avoids write amplification.
>> > > > - Each writer cares only about its own statistics files.
>> > > > Cons:
>> > > > - Requires readers to iterate over past snapshots in order to find
>> last valid entry written by a compatible writer.
>> > > >
>> > > > I've definitely left some pros and cons out, but you can roughly
>> map these cases to ways we handle existing file types (metadata, manifest
>> lists, manifests). I'm sure people who have spent time designing the spec
>> can more easily list out the possible pitfalls. In my humble opinion, #3
>> might be the most straightforward, but #2 is what I initially expected from
>> the spec. We are doing #1 internally because it's the only thing we can do
>> in the current situation.
>> > > >
>> > > > Let me know what you think.
>> > > > Cheers,
>> > > > Dzeri
>> > > >
>
>

Re: [DISCUSS] Handling of existing statistics files

Reply via email to