Re: What's so special about objects/17/ ?

2018-10-10 Thread Stefan Beller
On Tue, Oct 9, 2018 at 6:10 PM Junio C Hamano  wrote:
>
> Stefan Beller  writes:
> >> Oh, I think I misled you by saying "more important".
> >> ...
> > I do challenge the decision to take a hardcoded value, though, ...
>
> I do not find any reason why you need to say "though" here.

I caught myself using lots of filler-words lately.

  Though, however, I think, I would guess, IMHO
  fills a lot of space without saying much.

I'll reduce that.

>  If you
> understood the message you are responding to that use of hardcoded
> value was chosen not to help the end-user experience, it should have
> been clear that we are in agreement.

We are, but for different reasons.

> I also sometimes find certain people here are unnecessarily
> combative in their discussion.  It this just some language issue?

certain people? ;-)
I have issues with ambiguity in communication directed towards me,
which is why I sometimes try to be very direct and blunt.
Other times I strive on ambiguity as well (mostly in my reviews).


Re: What's so special about objects/17/ ?

2018-10-09 Thread Junio C Hamano
Stefan Beller  writes:
>> Oh, I think I misled you by saying "more important".
>> ...
> I do challenge the decision to take a hardcoded value, though, ...

I do not find any reason why you need to say "though" here.  If you
understood the message you are responding to that use of hardcoded
value was chosen not to help the end-user experience, it should have
been clear that we are in agreement.

I also sometimes find certain people here are unnecessarily
combative in their discussion.  It this just some language issue?




Re: What's so special about objects/17/ ?

2018-10-09 Thread Stefan Beller
On Mon, Oct 8, 2018 at 6:07 PM Junio C Hamano  wrote:
>
> Ævar Arnfjörð Bjarmason  writes:
>
> > Depending on how we're counting there's at least two.
>
> I thought you were asking "why the special sentinel is not 0{40}?"
> You counted the number of reasons why 0{40} is used to stand in for
> a real value, but that was the number I didn't find interesting in
> the scope of this discussion, i.e. "why the special sample is 17?"
>
> I vaguely recall we also used 0{39}1 for something else long time
> ago; I offhand do not recall if we still do, or we got rid of it.

gitk still shows changes added to the index as 0{39}1, whereas
changes not added yet are marked as 0{40}.


Re: What's so special about objects/17/ ?

2018-10-09 Thread Stefan Beller
On Mon, Oct 8, 2018 at 6:03 PM Junio C Hamano  wrote:
>
> Stefan Beller  writes:
>
> > On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano  wrote:
> >>
> >> Junio C Hamano  writes:
> >
> >> > ...
> >> > by general public and I do not have to explain the choice to the
> >> > general public ;-)
> >>
> >> One thing that is more important than "why not 00 but 17?" to answer
> >> is why a hardcoded number rather than a runtime random.  It is for
> >> repeatability.
> >
> > Let's talk about repeatability vs statistics for a second. ;-)
>
> Oh, I think I misled you by saying "more important".
>
> I didn't mean that it is more important to stick to the "use
> hardcoded value" design decision than sticking to "use 17".  I've
> made sure that everybody would understnd choosing any arbitrary byte
> value other than "17" does not make the resulting Git any better nor
> worse.

Yes, I totally get that. We could have chosen 42 just because.


>  But discussing the design decision to use hardcoded value is
> "more important", as that affects the balance between the end-user
> experience and debuggability, and I tried to help those who do not
> know the history by giving the fact that choice was made for the
> latter and not for other hidden reasons, that those who would
> propose to change the system may have to keep in mind.

>From an end users point of view, the auto gc kicks in at random.
(Maybe it's just me, but I don't keep track of the loose object count ;-)

For debuggability, we could design a system that allows for debugging,
e.g. "When GIT_AUTO_GC_BIN is set, use the number as set, otherwise
take a random slot".

> Sorry if you mistook it as if I were saying that it is important to
> keep the design to use a hardcoded byte value.  That wasn't what the
> message was about.

I understood very well that the choice of value was arbitrary and you
do not have a convincing story as to why 17 (and not say 23, but such
a story is not required, as all slots are equal from a design perspective).

I do challenge the decision to take a hardcoded value, though, as it
yields better properties for the end users IMHO, whereas debugging
this specific case does not seem to be important to me.


Re: What's so special about objects/17/ ?

2018-10-08 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> Depending on how we're counting there's at least two.

I thought you were asking "why the special sentinel is not 0{40}?"
You counted the number of reasons why 0{40} is used to stand in for
a real value, but that was the number I didn't find interesting in
the scope of this discussion, i.e. "why the special sample is 17?"

I vaguely recall we also used 0{39}1 for something else long time
ago; I offhand do not recall if we still do, or we got rid of it.


Re: What's so special about objects/17/ ?

2018-10-08 Thread Junio C Hamano
Stefan Beller  writes:

> On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano  wrote:
>>
>> Junio C Hamano  writes:
>
>> > ...
>> > by general public and I do not have to explain the choice to the
>> > general public ;-)
>>
>> One thing that is more important than "why not 00 but 17?" to answer
>> is why a hardcoded number rather than a runtime random.  It is for
>> repeatability.
>
> Let's talk about repeatability vs statistics for a second. ;-)

Oh, I think I misled you by saying "more important".  

I didn't mean that it is more important to stick to the "use
hardcoded value" design decision than sticking to "use 17".  I've
made sure that everybody would understnd choosing any arbitrary byte
value other than "17" does not make the resulting Git any better nor
worse.  But discussing the design decision to use hardcoded value is
"more important", as that affects the balance between the end-user
experience and debuggability, and I tried to help those who do not
know the history by giving the fact that choice was made for the
latter and not for other hidden reasons, that those who would
propose to change the system may have to keep in mind.

Sorry if you mistook it as if I were saying that it is important to
keep the design to use a hardcoded byte value.  That wasn't what the
message was about.


Re: What's so special about objects/17/ ?

2018-10-08 Thread Stefan Beller
On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano  wrote:
>
> Junio C Hamano  writes:

> > ...
> > by general public and I do not have to explain the choice to the
> > general public ;-)
>
> One thing that is more important than "why not 00 but 17?" to answer
> is why a hardcoded number rather than a runtime random.  It is for
> repeatability.

Let's talk about repeatability vs statistics for a second. ;-)

If I am a user and I were really into optimizing my loose object count
for some reason, so I would want to choose a low number of
gc.auto. Let's say I go with 128.

At the low end of loose objects the approximation is yielding
some high relative errors. This is because of the granularity, i.e.
gc would implicitly estimate the loose objects to be 0 or 256 or 512, (or more)
if there is 0, 1, 2 (or more) loose objects in the objects/17.

As each object can be viewed following an unfair coin flip
(With a chance of 1/256 it is in objects/17), the distribution in
objects/17 (and hence any other objects/XX bin) follows the
Bernoulli distribution.

If I do have say about 157 loose objects (and having auto.gc
configured anywhere in 1..255), then the probability to not
gc is 54% (as that is the probability to have 0 objects in /17,
following probability mass function of the Bernoulli distribution,
(i.e. Pr(0 objects) = (157 over 0) x (1/256)^0 x (255/256)^157))

As it is repeatable (by picking the same /17 every time), I can run
"gc --auto" multiple times and still have 157 loose objects, despite
wanting to have only 128 loose objects at a 54% chance.

If we'd roll the 256 dice every time to pick a different bin,
then we might hit another bin and gc in the second or third
gc, which would be more precise on average.

By having repeatability we allow for these numbers to be far off
more often when configuring small numbers.

I think that is the right choice, as we probably do not care about the
exactness of auto-gc for small numbers, as it is a performance
thing anyway. Although documenting it properly might be a challenge.

The current wording of auto.gc seems to suggest that we are right
for the number as we compute it via the implying the expected value,
(i.e. we pick a bin and multiply the fullness of the bin by the number
of bins to estimate the whole fullness, see the mean=n p on [1])
I think a user would be far more interested in giving an upper bound,
i.e. expressing something like "I will have at most $auto.gc objects
before gc kicks in" or "The likelihood to exceed the $auto.gc number
of loose objects by $this much is less than 5%", for which the math
would be more complicated, but easier to document with the words of
statistics.

[1]  https://en.wikipedia.org/wiki/Binomial_distribution

Stefan


Re: What's so special about objects/17/ ?

2018-10-08 Thread Ævar Arnfjörð Bjarmason


On Sun, Oct 07 2018, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  writes:
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>>objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>>as the  SHA-1?d  Statistically
>>it doesn't matter, but 17 seems like an odd thing to pick at random
>>out of 00..ff, does it have any significance?
>
> There is no "other 000* magic such as ...". There is only one 0{40}
> magic and that one must be memorable and explainable.

Depending on how we're counting there's at least two. We also use
 as a placeholder for "couldn't
read a ref" in addition or "this is a placeholder for an invalid ref" in
addition to how it's used to signify creation/deletion to the in the
likes of the pre-receive hook:

$ echo hello > .git/refs/something
$ git fsck
[...]
error: refs/something: invalid sha1 pointer 

$ > .git/refs/something
$ git fsck
[...]
error: refs/something: invalid sha1 pointer 


This is because the refs backend will memzero the oid struct, and if we
fail to read things it'll still be zero'd out.

This manifests e.g. in this confusing fsck output, due to a bug where
GitLab will write empty refs/keep-around/* refs sometimes:
https://gitlab.com/gitlab-org/gitlab-ce/issues/44431

> The 1/256 sample can be any one among 256.  Just like the date
> string on the first line of the output to be used as the /etc/magic
> signature by format-patch, it was an arbitrary choice, rather than a
> random choice, and unlike 0{40} this does not have to be memorable
> by general public and I do not have to explain the choice to the
> general public ;-)

I wanted to elaborate on the explanation for "gc.auto" in
git-config. Now we just say "approximately 6700". Since this behavior
has been really stable for a long time we could say we sample 1/256 of
the .git/objects/?? dirs, and this explains any perceived discrepancies
between the 6700 number and $(find .git/objects/?? -type f | wc -l).

>> 2. It seems overly paranoid to be checking that the files in
>>   .git/objects/17/ look like a SHA-1.
>
> There is no other reason than futureproofing.  We were paying cost
> to open and scan the directory anyway, and checking that we only
> count the loose object files was (and still is) a sensible thing to
> do to allow us not even worry about the other kind of things we
> might end up creating there.

Makes sense. Just wanted to ask if it was that or some workaround for
historical files being there.


Re: What's so special about objects/17/ ?

2018-10-07 Thread Junio C Hamano
Johannes Sixt  writes:

> Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason:
>> Picking any one number is explained in the comment. I'm asking why 17 in
>> particular not for correctness reasons but as a bit of historical lore,
>> and because my ulterior is to improve the GC docs.
>>
>> The number in that comic is 4 (and no datestamp on when it was
>> published). Are you saying Junio's patch is somehow a reference to that
>> xkcd in particular, or that it's just a funny reference in this context?
>
> No lore, AFAIR. It's just a random number, determined by a fair dice
> roll or something ;)

As I already said, I did not pick the number randomly, but rather
arbitrarily, and it is not 00 because the chosen number (unlike the
0{40} magic we use elsewhere) does not have to be memorable, and the
choice does not have to be explainable.

So people will not get any further explanation as to the reason
behind that arbitrary choice, but it was not random.


Re: What's so special about objects/17/ ?

2018-10-07 Thread Johannes Sixt

Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason:

Picking any one number is explained in the comment. I'm asking why 17 in
particular not for correctness reasons but as a bit of historical lore,
and because my ulterior is to improve the GC docs.

The number in that comic is 4 (and no datestamp on when it was
published). Are you saying Junio's patch is somehow a reference to that
xkcd in particular, or that it's just a funny reference in this context?


No lore, AFAIR. It's just a random number, determined by a fair dice 
roll or something ;)


-- Hannes


Re: What's so special about objects/17/ ?

2018-10-07 Thread Junio C Hamano
Junio C Hamano  writes:

> Ævar Arnfjörð Bjarmason  writes:
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>>objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>>as the  SHA-1?d  Statistically
>>it doesn't matter, but 17 seems like an odd thing to pick at random
>>out of 00..ff, does it have any significance?
>
> ...
> by general public and I do not have to explain the choice to the
> general public ;-)

One thing that is more important than "why not 00 but 17?" to answer
is why a hardcoded number rather than a runtime random.  It is for
repeatability.



Re: What's so special about objects/17/ ?

2018-10-07 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>as the  SHA-1?d  Statistically
>it doesn't matter, but 17 seems like an odd thing to pick at random
>out of 00..ff, does it have any significance?

There is no "other 000* magic such as ...". There is only one 0{40}
magic and that one must be memorable and explainable.

The 1/256 sample can be any one among 256.  Just like the date
string on the first line of the output to be used as the /etc/magic
signature by format-patch, it was an arbitrary choice, rather than a
random choice, and unlike 0{40} this does not have to be memorable
by general public and I do not have to explain the choice to the
general public ;-)

> 2. It seems overly paranoid to be checking that the files in
>   .git/objects/17/ look like a SHA-1.

There is no other reason than futureproofing.  We were paying cost
to open and scan the directory anyway, and checking that we only
count the loose object files was (and still is) a sensible thing to
do to allow us not even worry about the other kind of things we
might end up creating there.


Re: What's so special about objects/17/ ?

2018-10-07 Thread Ævar Arnfjörð Bjarmason


On Sun, Oct 07 2018, Johannes Sixt wrote:

> Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason:
>> In 2007 Junio wrote
>> (https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/):
>>
>>  +static int need_to_gc(void)
>>  +{
>>  +   /*
>>  +* Quickly check if a "gc" is needed, by estimating how
>>  +* many loose objects there are.  Because SHA-1 is evenly
>>  +* distributed, we can check only one and get a reasonable
>>  +* estimate.
>>  +*/
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>> objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>> as the  SHA-1?  Statistically
>> it doesn't matter, but 17 seems like an odd thing to pick at random
>> out of 00..ff, does it have any significance?
>
> The reason is explained in the comment. And, BTW, you do know about
> this one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random
> Number")

Picking any one number is explained in the comment. I'm asking why 17 in
particular not for correctness reasons but as a bit of historical lore,
and because my ulterior is to improve the GC docs.

The number in that comic is 4 (and no datestamp on when it was
published). Are you saying Junio's patch is somehow a reference to that
xkcd in particular, or that it's just a funny reference in this context?

>> 2. It seems overly paranoid to be checking that the files in
>>.git/objects/17/ look like a SHA-1. If we have stuff not generated by
>>git in .git/objects/??/ we probably have bigger problems than
>>prematurely triggering auto gc, can this just be removed as
>>redundant. Was this some check e.g. expecting that this would need to
>>deal with tempfiles in these directories that we created at the time
>>(but no longer do?)?
>
> It's not about that there are SHA-1s in there, it's about how many
> there are.

Right, I'm wondering if it couldn't be replaced by some general path.c
"number_of_files_in_dir" helper. I.e. why this code is being paranoid
about ignoring the likes of
.git/objects/17/{foo,bar,some-other-garbage}. A number_of_files_in_dir()
would obviously need to ignore "." and "..".


Re: What's so special about objects/17/ ?

2018-10-07 Thread Johannes Sixt

Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason:

In 2007 Junio wrote
(https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/):

 +static int need_to_gc(void)
 +{
 +  /*
 +   * Quickly check if a "gc" is needed, by estimating how
 +   * many loose objects there are.  Because SHA-1 is evenly
 +   * distributed, we can check only one and get a reasonable
 +   * estimate.
 +   */



1. We still have this check of objects/17/ in builtin/gc.c today. Why
objects/17/ and not e.g. objects/00/ to go with other 000* magic such
as the  SHA-1?  Statistically
it doesn't matter, but 17 seems like an odd thing to pick at random
out of 00..ff, does it have any significance?


The reason is explained in the comment. And, BTW, you do know about this 
one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random Number")



2. It seems overly paranoid to be checking that the files in
   .git/objects/17/ look like a SHA-1. If we have stuff not generated by
   git in .git/objects/??/ we probably have bigger problems than
   prematurely triggering auto gc, can this just be removed as
   redundant. Was this some check e.g. expecting that this would need to
   deal with tempfiles in these directories that we created at the time
   (but no longer do?)?


It's not about that there are SHA-1s in there, it's about how many there 
are.


-- Hannes


What's so special about objects/17/ ?

2018-10-07 Thread Ævar Arnfjörð Bjarmason
In 2007 Junio wrote
(https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/):

+static int need_to_gc(void)
+{
+   /*
+* Quickly check if a "gc" is needed, by estimating how
+* many loose objects there are.  Because SHA-1 is evenly
+* distributed, we can check only one and get a reasonable
+* estimate.
+*/
+   char path[PATH_MAX];
+   const char *objdir = get_object_directory();
+   DIR *dir;
+   struct dirent *ent;
+   int auto_threshold;
+   int num_loose = 0;
+   int needed = 0;
+
+   if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) {
+   warning("insanely long object directory %.*s", 50, objdir);
+   return 0;
+   }
+   dir = opendir(path);
+   if (!dir)
+   return 0;
+
+   auto_threshold = (gc_auto_threshold + 255) / 256;
+   while ((ent = readdir(dir)) != NULL) {
+   if (strspn(ent->d_name, "0123456789abcdef") != 38 ||
+   ent->d_name[38] != '\0')
+   continue;
+   if (++num_loose > auto_threshold) {
+   needed = 1;
+   break;
+   }
+   }

A couple of questions about this patch, which is in git.git as
2c3c439947 ("Implement git gc --auto", 2007-09-05)

1. We still have this check of objects/17/ in builtin/gc.c today. Why
   objects/17/ and not e.g. objects/00/ to go with other 000* magic such
   as the  SHA-1?  Statistically
   it doesn't matter, but 17 seems like an odd thing to pick at random
   out of 00..ff, does it have any significance?

2. It seems overly paranoid to be checking that the files in
  .git/objects/17/ look like a SHA-1. If we have stuff not generated by
  git in .git/objects/??/ we probably have bigger problems than
  prematurely triggering auto gc, can this just be removed as
  redundant. Was this some check e.g. expecting that this would need to
  deal with tempfiles in these directories that we created at the time
  (but no longer do?)?