On Fri, Mar 14, 2014 at 10:29 PM, Jeff King <p...@peff.net> wrote:
>> If an object is reused, we already know its compressed size. If it's
>> not reused and is a loose object, we could use on-disk size. It's a
>> lot harder to estimate an not-reused, deltified object. All we have is
>> the uncompressed size, and the size of each delta in the delta chain..
>> Neither gives a good hint of what the compressed size would be.
>
> Hmm. I think we do have the compressed delta size after having run the
> compression phase (because that is ultimately what we compare to find
> the best delta).

There are cases when we try not to find deltas (large blobs, file too
small, or -delta attribute). The large blob case is especially
interesting because progress bar crawls slowly when they write these
objects.

> Loose objects are probably the hardest here, as we
> actually recompress them (IIRC, because packfiles encode the type/size
> info outside of the compressed bit, whereas it is inside for loose
> objects; the "experimental loose" format harmonized this, but it never
> caught on).
>
> Without doing that recompression, any value you came up with would be an
> estimate, though it would be pretty close (not off by more than a few
> bytes per object).

That's my hope. Although if they tweak compression level then the
estimation could be off (gzip -9 and gzip -1 produce big difference in
size)

> However, you can't just run through the packing list
> and add up the object sizes; you'd need to do a real "dry-run" through
> the writing phase. There are probably more I'm missing, but you need at
> least to figure out:
>
>   1. The actual compressed size of a full loose object, as described
>      above.
>
>   2. The variable-length headers for each object based on its type and
>      size.

We could run through a "typical" repo, calculate the average header
length then use it for all objects?

>
>   3. The final form that the object will take based on what has come
>      before. For example, if there is a max pack size, we may split an
>      object from its delta base, in which case we have to throw away the
>      delta. We don't know where those breaks will be until we walk
>      through the whole list.

Ah this could probably be avoided. max pack size does not apply to
streaming pack-objects, where progress bar is most shown. Falling back
to object number in this case does not sound too bad.

>
>   4. If an object we attempt to reuse turns out to be corrupted, we
>      fall back to the non-reuse code path, which will have a different
>      size. So you'd need to actually check the reused object CRCs during
>      the dry-run (and for local repacks, not transfers, we actually
>      inflate and check the zlib, too, for safety).

Ugh..

>
> So I think it's _possible_. But it's definitely not trivial. For now, I
> think it makes sense to go with something like the patch I posted
> earlier (which I'll re-roll in a few minutes). That fixes what is IMHO a
> regression in the bitmaps case. And it does not make it any harder for
> somebody to later convert us to a true byte-counter (i.e., it is the
> easy half already).

Agreed.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to