+1 to Anoop's point 1, we ran some experiments (it's in the AMT doc appendix <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.d3kqot1t51f3#heading=h.1htfq3n4iktn>) and generic compression algorithms do help with run lengths, and improve the on disk size (up to a certain bitmap size, at which point it's diminishing returns but that bitmap size would be well above what an average manifest size would typically be)
On Wed, Apr 22, 2026 at 2:58 PM Anoop Johnson <[email protected]> wrote: > Hi, Guy. > > You're correct that a run-length encoding will make the bitmap much more > compact for clustered deletes. But it adds some complexity to the mumbling > spec which is currently very simple. > > A couple of considerations: > > 1. The bitmap is stored in the root manifests, which are Parquet in v4 and > typically zstd compressed. Generic compression algorithms typically handle > run lengths very well, so we may not see much difference in the on-disk > size of root manifests. > 2. In-memory footprint during query planning: readers typically will > stream through the root manifests instead of holding it all in-memory, so > it may not save much query engine memory either. > > Best, > Anoop > > On Wed, Apr 22, 2026 at 10:28 AM Guy Khazma <[email protected]> wrote: > >> Hi Ryan, >> >> Considering deletes can be clustered together due to sort order or >> clustering I wonder whether this encoding can also make use of run >> containers like roaring bitmap already do: >> https://arxiv.org/pdf/1603.06549 >> >> The same idea can be adapted to Mumbling 256 bit container size: >> 1. The descriptor will encode run_count - 1. >> 2. Within the container we will store 2*run_count bytes to represent the >> start and (length - 1) of each run. >> 3. the runs will be stored ascending by start and non overlapping (as >> well as contained within the block). >> >> This encoding is limited to 32 runs and on avg they have to be bigger >> than 2 but the clustering may be worth that. >> >> Thanks, >> Guy >> >> On 2026/04/20 23:02:04 Ryan Blue wrote: >> > Hi everyone, >> > >> > For the v4 adaptive metadata tree work, we are planning on embedding >> > bitmaps in the root manifest that act as metadata/manifest deletion >> vectors >> > (MDVs). Amogh looked into how much space this would take in the >> manifests >> > and we found that the Roaring format is pretty large at the scale we're >> > targeting. When we compare it to raw bitmaps, we would be storing an >> extra >> > 500-2,000 bytes per bitmap. As a result, I tried to see if we could use >> the >> > ideas from Roaring, but with smaller containers to fit better with our >> more >> > limited use case: manifests that contain roughly 50,000 entries (a >> single >> > Roaring container). Since it is like Roaring but smaller, I've been >> calling >> > the new format Mumbling. >> > >> > You can view the results comparing Roaring, raw bitmaps, and Mumbling >> > < >> https://docs.google.com/spreadsheets/d/1jOCMS6LYxqFcSKBa848tos2vhRBzZJQN3RqyUa0ElSY/edit?gid=0#gid=0>. >> >> > The results look promising: compressed sizes track much more closely to >> the >> > raw bitmap and the format has smaller overhead in memory than even >> Roaring >> > because of the more granular containers. >> > >> > The next steps are to discuss whether we want to use this format. To do >> > that, I've written up a Mumbling spec >> > <https://gist.github.com/rdblue/20ff3fd91df03483ed84e6de26583743> >> document >> > so that it is clear what exactly the format is doing. That should help >> us >> > evaluate the design choices >> > < >> https://gist.github.com/rdblue/20ff3fd91df03483ed84e6de26583743#design-choices> >> >> > and the cost of implementing this. >> > >> > I think that we should move forward with this bitmap format. It would >> save >> > quite a bit of space in the root manifest and it is a fairly simple >> spec. >> > My size tests used an implementation in Rust >> > <https://gist.github.com/rdblue/c989c8c2fecee951528973cbcc183a9b> that >> is >> > fairly compact so it is not a huge amount of work. I've also reached >> out >> > and we may be able to partner with the Roaring community to make this a >> > part of the larger standard. >> > >> > Please take a look and discuss. Thanks, >> > >> > Ryan >> > >> >
