Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

Derrick Stolee Fri, 15 Jun 2018 11:24:02 -0700

On 6/15/2018 1:31 PM, Jeff King wrote:

On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:

On 6/14/2018 11:44 PM, Jeff King wrote:

The return value of ewah_read_mmap() is now an ssize_t,
since we could (in theory) process up to 32GB of data. This
would never happen in practice, but a corrupt or malicious
.bitmap or index file could convince us to do so.

Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
(signed) or 4 GB (unsigned). Is there something specifically in the
bitmap format that stops at this size?

The proposed log message for 1/3 has this bit

   - check the size for integer overflow (this should be
     impossible on 64-bit, as the size is given as a 32-bit
     count of 8-byte words, but is possible on a 32-bit
     system)

4 Giga 8-byte words makes 32 Giga bytes, I'd guess.

Yes, exactly. It's definitely an odd size. I think using the same
varints we use in the packfile would be more efficient (and they're
already variable-length records), though we tend to have few enough of
these that it probably doesn't matter.

I think a more useful change for the bitmap format would be some kind of
index. IIRC, right now readers have to linearly scan the whole file in
order to use a single bitmap.

With Stolee's commit-metadata files, we could potentially move to
storing reachability bitmaps as a data element there. But there are two
complications:

  - the bitmaps themselves are dependent on having an ordered list of
    objects (one per bit) to talk about. And that's why they're
    integrated with packfiles, since that provides a constrained universe
    with a set ordering.

  - the existing storage isn't entirely independent between bitmaps. Some
    of them are basically "deltas" off of nearby bitmaps.


The VSTS reachability bitmap is different from the Git bitmap in a few ways.

1. It uses Roaring+Run bitmaps [1] instead of EWAH bitmap. This formatis simpler (in my opinion) in several ways, especially in how it splitsthe bitmap into 16-bit address ranges and compresses each on its own.This makes set containment queries much faster, as we can navigatedirectly to the region that is important (and then binary search if thatchunk is an "array" or "run" chunk). I say this is simpler because I canexplain the entire compression format to you in five minutes and awhiteboard. The paper [2] is a simple enough read (start at the "RoaringBitmap" section at the end of page 4).

2. Our file uses a chunk-based approach similar to the commit-graphfile: one chunk is simply a list of the commits that have bitmaps.Another chunk is the raw binary data of all bitmaps concatenatedtogether. The last chunk connects the other two chunks by a) providingan offset into the binary data for the bitmap at that commit, and b) thecommit that provides a delta base for the bitmap.

If we are considering changing the reachability bitmap, then I'm veryintrigued. I think the number one thing to consider is to use themulti-pack-index as a reference point (with a stable object order) sothe objects can be repacked independently from the reachability bitmapcomputation. If we are changing the model at that level, then it isworth thinking about other questions, like how we index the file or howwe compress the bitmaps.


Thanks,
-Stolee

[1] https://roaringbitmap.org/about/

[2] https://arxiv.org/abs/1603.06549
    Consistently faster and smaller compressed bitmaps with Roaring

Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

Reply via email to