On 6/15/2018 1:31 PM, Jeff King wrote:
On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:

On 6/14/2018 11:44 PM, Jeff King wrote:
The return value of ewah_read_mmap() is now an ssize_t,
since we could (in theory) process up to 32GB of data. This
would never happen in practice, but a corrupt or malicious
.bitmap or index file could convince us to do so.
Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
(signed) or 4 GB (unsigned). Is there something specifically in the
bitmap format that stops at this size?
The proposed log message for 1/3 has this bit

   - check the size for integer overflow (this should be
     impossible on 64-bit, as the size is given as a 32-bit
     count of 8-byte words, but is possible on a 32-bit
     system)

4 Giga 8-byte words makes 32 Giga bytes, I'd guess.
Yes, exactly. It's definitely an odd size. I think using the same
varints we use in the packfile would be more efficient (and they're
already variable-length records), though we tend to have few enough of
these that it probably doesn't matter.

I think a more useful change for the bitmap format would be some kind of
index. IIRC, right now readers have to linearly scan the whole file in
order to use a single bitmap.

With Stolee's commit-metadata files, we could potentially move to
storing reachability bitmaps as a data element there. But there are two
complications:

  - the bitmaps themselves are dependent on having an ordered list of
    objects (one per bit) to talk about. And that's why they're
    integrated with packfiles, since that provides a constrained universe
    with a set ordering.

  - the existing storage isn't entirely independent between bitmaps. Some
    of them are basically "deltas" off of nearby bitmaps.

The VSTS reachability bitmap is different from the Git bitmap in a few ways.

1. It uses Roaring+Run bitmaps [1] instead of EWAH bitmap. This format is simpler (in my opinion) in several ways, especially in how it splits the bitmap into 16-bit address ranges and compresses each on its own. This makes set containment queries much faster, as we can navigate directly to the region that is important (and then binary search if that chunk is an "array" or "run" chunk). I say this is simpler because I can explain the entire compression format to you in five minutes and a whiteboard. The paper [2] is a simple enough read (start at the "Roaring Bitmap" section at the end of page 4).

2. Our file uses a chunk-based approach similar to the commit-graph file: one chunk is simply a list of the commits that have bitmaps. Another chunk is the raw binary data of all bitmaps concatenated together. The last chunk connects the other two chunks by a) providing an offset into the binary data for the bitmap at that commit, and b) the commit that provides a delta base for the bitmap.

If we are considering changing the reachability bitmap, then I'm very intrigued. I think the number one thing to consider is to use the multi-pack-index as a reference point (with a stable object order) so the objects can be repacked independently from the reachability bitmap computation. If we are changing the model at that level, then it is worth thinking about other questions, like how we index the file or how we compress the bitmaps.

Thanks,
-Stolee

[1] https://roaringbitmap.org/about/

[2] https://arxiv.org/abs/1603.06549
    Consistently faster and smaller compressed bitmaps with Roaring

Reply via email to