On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce <spea...@spearce.org> wrote: > On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhag...@alum.mit.edu> wrote: >> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spea...@spearce.org> wrote: >>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhag...@alum.mit.edu> >>> wrote: >>>> [...] >>> A couple of other notes about your contrasting design: >>> >>>> elif chunk_type == INDEX { >>>> chunk_length : varint >>> >>> Using a varint for the chunk length made for a complicated reader. >>> JGit doesn't have the luxury of mmap to access the file, so we have to >>> allocate a byte[] and read data from a file descriptor to do anything >>> fancy like decoding a varint. For my experiment I wound up just >>> hardcoding the IO to read 1k or 4k from whatever address. >>> >>> A "real" implementation would likely prefer to read a fixed width >>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2 >>> byte chunk_length), and then issue a second read to acquire the rest >>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both >>> requires 2 bytes of varint, its always going to be 2 bytes in your >>> design anyway. With the way chunks are scanned, I don't think you want >>> chunks as large as 16k, which would have caused the varint to go to 3 >>> bytes (but still fits in a fixed 2-byte chunk_length). >> >> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS >> blocks that include reflogs, the block size has to be large enough to >> hold the whole reflog for a reference, which can be arbitrarily large. >> (Maybe this is a weakness of the design?) > > Gah. I missed that part about the whole reflog needing to fit in the > same chunk when I wrote the quoted text above. That is a pretty big > downside for very busy refs. > > Even when you isolate logs into their own file, the reflog for a busy > ref could be huge. A naive reader would want to "page in" the entire > chunk_length before parsing. That isn't tenable if the reflog for a > busy ref was say 50 MiB. It complicates the reader more. My reftable > proposal deals with this by breaking the log up with its special key > structure. > > Requiring the entire reflog of a single ref to fit into a single chunk > does seem to have its downsides.
I was assuming that readers would uncompress the data streamily, in which case I don't think that it would be much of a problem: reflogs are usually read either in their entirety, or just the most recent few entries are read, either of which could be done efficiently despite the whole reflog being in a single zlib-compressed blob. If streamy reading is thought to be too complicated for readers, it wouldn't be a big deal to add a `log_type` of `REFLOG_COMPRESSED_SEGMENTED` and put the entries into multiple, smaller zlib-compressed chunks. >> OBJS blocks can also be >> unbounded in size if very many references point at the same object, >> thought that is perhaps only a theoretical problem. > > Gah, I missed that in reftable. The block id pointer list could cause > a single object id to exceed what fits in a block, and that will cause > the writer to fail unless its caller sets the block size larger. I > basically assumed this overflow condition is very unlikely, as its not > common to have a huge number of refs pointing to the same object. Given what Peff pointed out, let's just leave this as a varint for OBJS blocks. >>>> elif chunk_type == OBJS_INDEX { >>>> chunk_length : varint >>>> >>>> # The offset, relative to the start of this chunk, of the >>>> # chunk containing the next level of the obj index, for each >>>> # of the possible "next" bytes in the SHA-1, or zero if there >>>> # are no references with the given next byte. >>>> child_offset : varint * 256 >>> >>> This is space saving and cute, but kind of annoying. If it was fixed >>> width 32 bit you can address up to 4G away from this chunk's address, >>> and you can directly jump to the byte of interest. By being varints >>> you do save a little space, as most files will probably only need 3 >>> byte varints, and the 0s do collapse down to 1 byte, but you have to >>> linearly walk the list to find any specific byte. >> >> The reason I used varint here was mostly because I expect the lowest >> level of the OBJS_INDEX to be placed close to the OBJS chunks that it >> refers to; hopefully within a 2-byte varint. Let me consider more >> carefully whether that is realistic... > > Ah, ok. So you were expecting the writer to interleave OBJS and > OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS > chunks. And then storing the next level of OBJS_INDEX consecutively. > > I think your math is still wrong about the lowest level OBJS_INDEX > needing only 2-byte varint for child_offset. Given a 1k chunk size, > you still have to address backwards about 256k, which requires a > 3-byte varint. I think that you would want the lowest OBJS_INDEX to be mostly populated, meaning that the next level nodes in the tree would mostly only have zero or one entry. The true distribution [1] would be close to a Poisson distribution with a mean of `λ ≲ 1` (where λ is basically the filling factor), which looks like P(λ, n) = λⁿ exp(-λ) / n! For `λ = 1`, that looks like P(1, 0) = 36.8% P(1, 1) = 36.8% P(1, 2) = 18.4% P(1, 3) = 06.1% P(1, 4) = 01.5% P(1, 5) = 00.3% P(1, 6) = 00.1% so almost all OBJS chunks would have fewer than five entries. And in fact the mean number of entries would be `λ`. So the total size of the OBJS chunks pointed at by a given OBJS_INDEX chunk should be something like 256 * λ * sizeof(OBJS chunk) ≲ 1800 bytes. [1] This is assuming randomness of SHA-1s, which AFAIK is normally a good approximation. It is of course possible that somebody would try to grief the system by creating a repository with lots of objects that have similar SHA-1 prefixes, but I think the worst result would be that some operations would scale like O(number of references) rather than O(256), which isn't all that pathological, and the distinct pattern would be clear evidence of malice that would justify banning the user. > Given the uniform distribution of SHA-1, I think you still wind up > with most of the OBJS_INDEX populated. E.g. in my 866k ref/865k obj > example the unique object abbreviation is 6 raw bytes. The OBJS chunk > storing 2 bytes means we have 4 levels of OBJS_INDEX to cover the > first 4 bytes. That leaves ~844 obj_records in each OBJS chunk. If > those are 7 bytes/obj_record, the OBJS chunk is ~5.7 KiB. > > Forcing the OBJS chunk to stay under say 4 KiB will absolutely cause a > lot more holes in the OBJS_INDEX levels. I was thinking more about how to store the objects lookup table. Let's assume that we combine the lowest-level OBJS_INDEX chunk along with the OBJS chunks that it points to into a new OBJS_LEAF chunk. I think the goals are as follows: * The lowest OBJS_LEAF nodes should have a filling factor of approximately `λ = 1`. * It should be possible to adjust the size of the lowest OBJS_LEAF nodes based on the preferred read size. (This might mean that all of the OBJS_LEAF nodes referred to by the next-higher node in the tree fit together with it in a single 64k block.) * That higher-level nodes also can be packed efficiently in blocks of roughly the preferred read size. * That false positives be the exception. If we ask a higher-level reftable whether it might contain a reference to $sha1, it would be nice to be able to get a "no" answer without having to read actual references. I think that is an argument why we *don't* want to store only the largest unique prefix of the contained SHA-1s. I think we want to *always* store enough bits of SHA-1 prefix to make the referred-to SHA-1s represent only a small fraction (say, 1/256th) of the total prefix space. E.g., if references in a particular repository point to 2ⁿ distinct SHA-1s, then we'd want the prefix to include something like n+8 bits. To achieve these goals, I think we it would be best to make the number of bits used in each level of the tree be variable rather than hard-coded to be 8. By changing the radix of different levels of the tree, I think you could arrange for block sizes that are optimal for your filesystem at the same time that you ensure that the OBJS_LEAF nodes have the desired filling factor. If anybody's interested, I can flesh out these ideas more next week when I have time. >> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is >> so important in the first place. Currently, all references must be >> scanned for the advertisement anyway, > > Not really. You can hide refs and allow-tip-sha1 so clients can fetch > a ref even if it wasn't in the advertisement. We really want to use > that wire protocol capability with Gerrit Code Review to hide the > refs/changes/ namespace from the advertisement, but allow clients to > fetch any of those refs if they send its current SHA-1 in a want line > anyway. > > So a server could scan only the refs/{heads,tags}/ prefixes for the > advertisement, and then leverage the lookup-by-SHA1 to verify other > SHA-1s sent by the client. > >> so avoiding a second scan to vet >> SHA-1s received from the client is at best going to reduce the effort >> by a constant factor. Do you have numbers showing that this >> optimization is worth it? > > No, but I don't think I need to do much to prove it. My 866k ref > example advertisement right now is >62 MiB. If we do what I'm > suggesting in the paragraphs above, the advertisement is ~51 KiB. > >> OTOH a mythical protocol v2 might reduce the need to scan the >> references for advertisement, so maybe this optimization will be more >> helpful in the future? > > Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan > Tan is doing, and switch the advertisement around to "client speaks > first", so that we can be smarter on the server about which refs are > read and sent. That is a long way off, lets say 3-5 years before its > really common in clients. My takeaway from the discussion about whether object lookup is an important feature is that it should very much be optional, and by extension we should probably leave room for other optional extensions in the file format. > [...] >> That's what I was thinking. But I've yet to hear anybody complain >> about missing reflogs for symrefs if the underlying reference is >> updated directly, so maybe we should just leave that "problem" >> unsolved. It is certainly simpler and less brittle not to have to keep >> backreferences like these in sync with the forward references. > > Yup, that is my take on it, and why I didn't try to put this into > reftable drafts, even though it was discussed between us on the list > in earlier messages. Yes, let's forget the idea of including backreferences from refs to the symrefs that refer to them. Michael