Hi Ryan,

Considering deletes can be clustered together due to sort order or
clustering I wonder whether this encoding can also make use of run
containers like roaring bitmap already do:
https://arxiv.org/pdf/1603.06549

The same idea can be adapted to Mumbling 256 bit container size:
1. The descriptor will encode run_count - 1.
2. Within the container we will store 2*run_count bytes to represent the
start and (length - 1) of each run.
3. the runs will be stored ascending by start and non overlapping (as well
as contained within the block).

This encoding is limited to 32 runs and on avg they have to be bigger than
2 but the clustering may be worth that.

Thanks,
Guy

On 2026/04/20 23:02:04 Ryan Blue wrote:
> Hi everyone,
>
> For the v4 adaptive metadata tree work, we are planning on embedding
> bitmaps in the root manifest that act as metadata/manifest deletion
vectors
> (MDVs). Amogh looked into how much space this would take in the manifests
> and we found that the Roaring format is pretty large at the scale we're
> targeting. When we compare it to raw bitmaps, we would be storing an
extra
> 500-2,000 bytes per bitmap. As a result, I tried to see if we could use
the
> ideas from Roaring, but with smaller containers to fit better with our
more
> limited use case: manifests that contain roughly 50,000 entries (a single
> Roaring container). Since it is like Roaring but smaller, I've been
calling
> the new format Mumbling.
>
> You can view the results comparing Roaring, raw bitmaps, and Mumbling
> <
https://docs.google.com/spreadsheets/d/1jOCMS6LYxqFcSKBa848tos2vhRBzZJQN3RqyUa0ElSY/edit?gid=0#gid=0>.

> The results look promising: compressed sizes track much more closely to
the
> raw bitmap and the format has smaller overhead in memory than even
Roaring
> because of the more granular containers.
>
> The next steps are to discuss whether we want to use this format. To do
> that, I've written up a Mumbling spec
> <https://gist.github.com/rdblue/20ff3fd91df03483ed84e6de26583743>
document
> so that it is clear what exactly the format is doing. That should help us
> evaluate the design choices
> <
https://gist.github.com/rdblue/20ff3fd91df03483ed84e6de26583743#design-choices>

> and the cost of implementing this.
>
> I think that we should move forward with this bitmap format. It would
save
> quite a bit of space in the root manifest and it is a fairly simple spec.
> My size tests used an implementation in Rust
> <https://gist.github.com/rdblue/c989c8c2fecee951528973cbcc183a9b> that is
> fairly compact so it is not a huge amount of work. I've also reached out
> and we may be able to partner with the Roaring community to make this a
> part of the larger standard.
>
> Please take a look and discuss. Thanks,
>
> Ryan
>

Reply via email to