On Fri, Sep 29, 2017 at 12:04:56PM -0700, Jonathan Tan wrote:
> > So depending how you count it, we're wasting between 28% (sha1 and no
> > extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> > it's probably not breaking the bank.
>
> Hmm...how did you get the 16% figure? The values, as I see it, are:
> - 32 for the sha256 hash itself
> - 8 for the "next" chain pointer
> - 8 for the padded hash
> - 8 for the "util" pointer
>
> For an oidset, the padded hash and the "util" pointer are wasted, which is
> 16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
> 16/64=0.25.)
Sorry to be unclear. I was just counting the waste for the "util"
pointer, not the extra padded hash bit (which AFAIK is already wasted in
oidset). So I computed 48 bytes without the util pointer, which means we
waste an additional 16% to add it.
Anyway, my point was mostly to say that this is a fractional percentage
of the total memory. So it falls into the category of "this might help
in tight situations" and less "this will blow up in our faces".
> In a 100-million-object monorepo, we will probably end up only operating on
> the
> "frontier" objects (objects that we do not have but we know about because some
> object we have reference them) at the worst. I don't have numbers but I think
> that that is at least an order of magnitude less than 100M.
Agreed.
> > So I think we may be better off going with the solution here that's
> > simpler and requires introducing less code. If it does turn out to be a
> > memory problem in the future, this is a _really_ easy thing to optimize
> > after the fact, because we have these nice abstractions.
>
> Optimizing away the padded hash should be straightforward. Optimizing away the
> "util" pointer after we have code that uses it is (slightly?) less
> straightforward, but still doable.
I was thinking here of just oidset. It has a limited API, so swapping
out the implementation for one that does not depend on oidmap and waste
the "util" pointer would be the "easy" part.
> I still lean towards saving memory by eliminating the padded hash and
> not using the "util" pointer, because the complication is contained
> within only two files (oidmap.{c,h}), and because the constant factors
> in memory savings might end up mattering. But I don't feel strongly
> about that - I think any of the oidmaps that we have discussed is an
> improvement over what we have now.
My main concern is not so much about complexity bleeding out of the
oidmap code, but that now we have two parallel near-identical hashmap
implementations. When one gets an optimization, bugfix, or new feature,
we have to port it over manually to the other.
-Peff