On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spea...@spearce.org> wrote:
> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhag...@alum.mit.edu> 
> wrote:
>
>> * The tuning parameter number_of_restarts currently trades off space
>> (for the full refnames and the restart_offsets) against the need to
>> read and parse more ref_records to get the full refnames. ISTM that
>> this tradeoff could be made less painful by defining a blockwide
>> prefix that is omitted from the refnames as used in the restarts. So
>> the full refname would change from
>>
>>       this_name = prior_name[0..prefix_length] + suffix
>>
>>   to
>>
>>       this_name = block_prefix + prior_name[0..prefix_length] + suffix
>>
>>   I would expect this to allow more frequent restarts at lower space
>> cost.
>
> I've been on the fence about the value of this. It makes the search
> with restarts more difficult to implement, but does allow shrinking a
> handful of very popular prefixes like "refs/" and "refs/pulls/" in
> some blocks.
>
> An older format of reftable used only a block_prefix, and could not
> get nearly as good compression as too many blocks contained references
> with different prefixes.


I ran an experiment on my 866k ref data set. Using a block_prefix gets
less compression, and doesn't improve packing in the file. Given the
additional code complexity, it really isn't worth it:

format           |  size    |  blocks |  avg ref/blk
------------------|----------|-----------|----------------
original          | 28 M   |   443    |  1955
block_prefix  |  29 M  |   464    | 1867

:-(

Reply via email to