Re: What's so special about objects/17/ ?
On Tue, Oct 9, 2018 at 6:10 PM Junio C Hamano wrote: > > Stefan Beller writes: > >> Oh, I think I misled you by saying "more important". > >> ... > > I do challenge the decision to take a hardcoded value, though, ... > > I do not find any reason why you need to say "though" here. I caught myself using lots of filler-words lately. Though, however, I think, I would guess, IMHO fills a lot of space without saying much. I'll reduce that. > If you > understood the message you are responding to that use of hardcoded > value was chosen not to help the end-user experience, it should have > been clear that we are in agreement. We are, but for different reasons. > I also sometimes find certain people here are unnecessarily > combative in their discussion. It this just some language issue? certain people? ;-) I have issues with ambiguity in communication directed towards me, which is why I sometimes try to be very direct and blunt. Other times I strive on ambiguity as well (mostly in my reviews).
Re: What's so special about objects/17/ ?
Stefan Beller writes: >> Oh, I think I misled you by saying "more important". >> ... > I do challenge the decision to take a hardcoded value, though, ... I do not find any reason why you need to say "though" here. If you understood the message you are responding to that use of hardcoded value was chosen not to help the end-user experience, it should have been clear that we are in agreement. I also sometimes find certain people here are unnecessarily combative in their discussion. It this just some language issue?
Re: What's so special about objects/17/ ?
On Mon, Oct 8, 2018 at 6:07 PM Junio C Hamano wrote: > > Ævar Arnfjörð Bjarmason writes: > > > Depending on how we're counting there's at least two. > > I thought you were asking "why the special sentinel is not 0{40}?" > You counted the number of reasons why 0{40} is used to stand in for > a real value, but that was the number I didn't find interesting in > the scope of this discussion, i.e. "why the special sample is 17?" > > I vaguely recall we also used 0{39}1 for something else long time > ago; I offhand do not recall if we still do, or we got rid of it. gitk still shows changes added to the index as 0{39}1, whereas changes not added yet are marked as 0{40}.
Re: What's so special about objects/17/ ?
On Mon, Oct 8, 2018 at 6:03 PM Junio C Hamano wrote: > > Stefan Beller writes: > > > On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano wrote: > >> > >> Junio C Hamano writes: > > > >> > ... > >> > by general public and I do not have to explain the choice to the > >> > general public ;-) > >> > >> One thing that is more important than "why not 00 but 17?" to answer > >> is why a hardcoded number rather than a runtime random. It is for > >> repeatability. > > > > Let's talk about repeatability vs statistics for a second. ;-) > > Oh, I think I misled you by saying "more important". > > I didn't mean that it is more important to stick to the "use > hardcoded value" design decision than sticking to "use 17". I've > made sure that everybody would understnd choosing any arbitrary byte > value other than "17" does not make the resulting Git any better nor > worse. Yes, I totally get that. We could have chosen 42 just because. > But discussing the design decision to use hardcoded value is > "more important", as that affects the balance between the end-user > experience and debuggability, and I tried to help those who do not > know the history by giving the fact that choice was made for the > latter and not for other hidden reasons, that those who would > propose to change the system may have to keep in mind. >From an end users point of view, the auto gc kicks in at random. (Maybe it's just me, but I don't keep track of the loose object count ;-) For debuggability, we could design a system that allows for debugging, e.g. "When GIT_AUTO_GC_BIN is set, use the number as set, otherwise take a random slot". > Sorry if you mistook it as if I were saying that it is important to > keep the design to use a hardcoded byte value. That wasn't what the > message was about. I understood very well that the choice of value was arbitrary and you do not have a convincing story as to why 17 (and not say 23, but such a story is not required, as all slots are equal from a design perspective). I do challenge the decision to take a hardcoded value, though, as it yields better properties for the end users IMHO, whereas debugging this specific case does not seem to be important to me.
Re: What's so special about objects/17/ ?
Ævar Arnfjörð Bjarmason writes: > Depending on how we're counting there's at least two. I thought you were asking "why the special sentinel is not 0{40}?" You counted the number of reasons why 0{40} is used to stand in for a real value, but that was the number I didn't find interesting in the scope of this discussion, i.e. "why the special sample is 17?" I vaguely recall we also used 0{39}1 for something else long time ago; I offhand do not recall if we still do, or we got rid of it.
Re: What's so special about objects/17/ ?
Stefan Beller writes: > On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano wrote: >> >> Junio C Hamano writes: > >> > ... >> > by general public and I do not have to explain the choice to the >> > general public ;-) >> >> One thing that is more important than "why not 00 but 17?" to answer >> is why a hardcoded number rather than a runtime random. It is for >> repeatability. > > Let's talk about repeatability vs statistics for a second. ;-) Oh, I think I misled you by saying "more important". I didn't mean that it is more important to stick to the "use hardcoded value" design decision than sticking to "use 17". I've made sure that everybody would understnd choosing any arbitrary byte value other than "17" does not make the resulting Git any better nor worse. But discussing the design decision to use hardcoded value is "more important", as that affects the balance between the end-user experience and debuggability, and I tried to help those who do not know the history by giving the fact that choice was made for the latter and not for other hidden reasons, that those who would propose to change the system may have to keep in mind. Sorry if you mistook it as if I were saying that it is important to keep the design to use a hardcoded byte value. That wasn't what the message was about.
Re: What's so special about objects/17/ ?
On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano wrote: > > Junio C Hamano writes: > > ... > > by general public and I do not have to explain the choice to the > > general public ;-) > > One thing that is more important than "why not 00 but 17?" to answer > is why a hardcoded number rather than a runtime random. It is for > repeatability. Let's talk about repeatability vs statistics for a second. ;-) If I am a user and I were really into optimizing my loose object count for some reason, so I would want to choose a low number of gc.auto. Let's say I go with 128. At the low end of loose objects the approximation is yielding some high relative errors. This is because of the granularity, i.e. gc would implicitly estimate the loose objects to be 0 or 256 or 512, (or more) if there is 0, 1, 2 (or more) loose objects in the objects/17. As each object can be viewed following an unfair coin flip (With a chance of 1/256 it is in objects/17), the distribution in objects/17 (and hence any other objects/XX bin) follows the Bernoulli distribution. If I do have say about 157 loose objects (and having auto.gc configured anywhere in 1..255), then the probability to not gc is 54% (as that is the probability to have 0 objects in /17, following probability mass function of the Bernoulli distribution, (i.e. Pr(0 objects) = (157 over 0) x (1/256)^0 x (255/256)^157)) As it is repeatable (by picking the same /17 every time), I can run "gc --auto" multiple times and still have 157 loose objects, despite wanting to have only 128 loose objects at a 54% chance. If we'd roll the 256 dice every time to pick a different bin, then we might hit another bin and gc in the second or third gc, which would be more precise on average. By having repeatability we allow for these numbers to be far off more often when configuring small numbers. I think that is the right choice, as we probably do not care about the exactness of auto-gc for small numbers, as it is a performance thing anyway. Although documenting it properly might be a challenge. The current wording of auto.gc seems to suggest that we are right for the number as we compute it via the implying the expected value, (i.e. we pick a bin and multiply the fullness of the bin by the number of bins to estimate the whole fullness, see the mean=n p on [1]) I think a user would be far more interested in giving an upper bound, i.e. expressing something like "I will have at most $auto.gc objects before gc kicks in" or "The likelihood to exceed the $auto.gc number of loose objects by $this much is less than 5%", for which the math would be more complicated, but easier to document with the words of statistics. [1] https://en.wikipedia.org/wiki/Binomial_distribution Stefan
Re: What's so special about objects/17/ ?
On Sun, Oct 07 2018, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> 1. We still have this check of objects/17/ in builtin/gc.c today. Why >>objects/17/ and not e.g. objects/00/ to go with other 000* magic such >>as the SHA-1?d Statistically >>it doesn't matter, but 17 seems like an odd thing to pick at random >>out of 00..ff, does it have any significance? > > There is no "other 000* magic such as ...". There is only one 0{40} > magic and that one must be memorable and explainable. Depending on how we're counting there's at least two. We also use as a placeholder for "couldn't read a ref" in addition or "this is a placeholder for an invalid ref" in addition to how it's used to signify creation/deletion to the in the likes of the pre-receive hook: $ echo hello > .git/refs/something $ git fsck [...] error: refs/something: invalid sha1 pointer $ > .git/refs/something $ git fsck [...] error: refs/something: invalid sha1 pointer This is because the refs backend will memzero the oid struct, and if we fail to read things it'll still be zero'd out. This manifests e.g. in this confusing fsck output, due to a bug where GitLab will write empty refs/keep-around/* refs sometimes: https://gitlab.com/gitlab-org/gitlab-ce/issues/44431 > The 1/256 sample can be any one among 256. Just like the date > string on the first line of the output to be used as the /etc/magic > signature by format-patch, it was an arbitrary choice, rather than a > random choice, and unlike 0{40} this does not have to be memorable > by general public and I do not have to explain the choice to the > general public ;-) I wanted to elaborate on the explanation for "gc.auto" in git-config. Now we just say "approximately 6700". Since this behavior has been really stable for a long time we could say we sample 1/256 of the .git/objects/?? dirs, and this explains any perceived discrepancies between the 6700 number and $(find .git/objects/?? -type f | wc -l). >> 2. It seems overly paranoid to be checking that the files in >> .git/objects/17/ look like a SHA-1. > > There is no other reason than futureproofing. We were paying cost > to open and scan the directory anyway, and checking that we only > count the loose object files was (and still is) a sensible thing to > do to allow us not even worry about the other kind of things we > might end up creating there. Makes sense. Just wanted to ask if it was that or some workaround for historical files being there.
Re: What's so special about objects/17/ ?
Johannes Sixt writes: > Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason: >> Picking any one number is explained in the comment. I'm asking why 17 in >> particular not for correctness reasons but as a bit of historical lore, >> and because my ulterior is to improve the GC docs. >> >> The number in that comic is 4 (and no datestamp on when it was >> published). Are you saying Junio's patch is somehow a reference to that >> xkcd in particular, or that it's just a funny reference in this context? > > No lore, AFAIR. It's just a random number, determined by a fair dice > roll or something ;) As I already said, I did not pick the number randomly, but rather arbitrarily, and it is not 00 because the chosen number (unlike the 0{40} magic we use elsewhere) does not have to be memorable, and the choice does not have to be explainable. So people will not get any further explanation as to the reason behind that arbitrary choice, but it was not random.
Re: What's so special about objects/17/ ?
Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason: Picking any one number is explained in the comment. I'm asking why 17 in particular not for correctness reasons but as a bit of historical lore, and because my ulterior is to improve the GC docs. The number in that comic is 4 (and no datestamp on when it was published). Are you saying Junio's patch is somehow a reference to that xkcd in particular, or that it's just a funny reference in this context? No lore, AFAIR. It's just a random number, determined by a fair dice roll or something ;) -- Hannes
Re: What's so special about objects/17/ ?
Junio C Hamano writes: > Ævar Arnfjörð Bjarmason writes: > >> 1. We still have this check of objects/17/ in builtin/gc.c today. Why >>objects/17/ and not e.g. objects/00/ to go with other 000* magic such >>as the SHA-1?d Statistically >>it doesn't matter, but 17 seems like an odd thing to pick at random >>out of 00..ff, does it have any significance? > > ... > by general public and I do not have to explain the choice to the > general public ;-) One thing that is more important than "why not 00 but 17?" to answer is why a hardcoded number rather than a runtime random. It is for repeatability.
Re: What's so special about objects/17/ ?
Ævar Arnfjörð Bjarmason writes: > 1. We still have this check of objects/17/ in builtin/gc.c today. Why >objects/17/ and not e.g. objects/00/ to go with other 000* magic such >as the SHA-1?d Statistically >it doesn't matter, but 17 seems like an odd thing to pick at random >out of 00..ff, does it have any significance? There is no "other 000* magic such as ...". There is only one 0{40} magic and that one must be memorable and explainable. The 1/256 sample can be any one among 256. Just like the date string on the first line of the output to be used as the /etc/magic signature by format-patch, it was an arbitrary choice, rather than a random choice, and unlike 0{40} this does not have to be memorable by general public and I do not have to explain the choice to the general public ;-) > 2. It seems overly paranoid to be checking that the files in > .git/objects/17/ look like a SHA-1. There is no other reason than futureproofing. We were paying cost to open and scan the directory anyway, and checking that we only count the loose object files was (and still is) a sensible thing to do to allow us not even worry about the other kind of things we might end up creating there.
Re: What's so special about objects/17/ ?
On Sun, Oct 07 2018, Johannes Sixt wrote: > Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason: >> In 2007 Junio wrote >> (https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/): >> >> +static int need_to_gc(void) >> +{ >> + /* >> +* Quickly check if a "gc" is needed, by estimating how >> +* many loose objects there are. Because SHA-1 is evenly >> +* distributed, we can check only one and get a reasonable >> +* estimate. >> +*/ > >> 1. We still have this check of objects/17/ in builtin/gc.c today. Why >> objects/17/ and not e.g. objects/00/ to go with other 000* magic such >> as the SHA-1? Statistically >> it doesn't matter, but 17 seems like an odd thing to pick at random >> out of 00..ff, does it have any significance? > > The reason is explained in the comment. And, BTW, you do know about > this one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random > Number") Picking any one number is explained in the comment. I'm asking why 17 in particular not for correctness reasons but as a bit of historical lore, and because my ulterior is to improve the GC docs. The number in that comic is 4 (and no datestamp on when it was published). Are you saying Junio's patch is somehow a reference to that xkcd in particular, or that it's just a funny reference in this context? >> 2. It seems overly paranoid to be checking that the files in >>.git/objects/17/ look like a SHA-1. If we have stuff not generated by >>git in .git/objects/??/ we probably have bigger problems than >>prematurely triggering auto gc, can this just be removed as >>redundant. Was this some check e.g. expecting that this would need to >>deal with tempfiles in these directories that we created at the time >>(but no longer do?)? > > It's not about that there are SHA-1s in there, it's about how many > there are. Right, I'm wondering if it couldn't be replaced by some general path.c "number_of_files_in_dir" helper. I.e. why this code is being paranoid about ignoring the likes of .git/objects/17/{foo,bar,some-other-garbage}. A number_of_files_in_dir() would obviously need to ignore "." and "..".
Re: What's so special about objects/17/ ?
Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason: In 2007 Junio wrote (https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/): +static int need_to_gc(void) +{ + /* + * Quickly check if a "gc" is needed, by estimating how + * many loose objects there are. Because SHA-1 is evenly + * distributed, we can check only one and get a reasonable + * estimate. + */ 1. We still have this check of objects/17/ in builtin/gc.c today. Why objects/17/ and not e.g. objects/00/ to go with other 000* magic such as the SHA-1? Statistically it doesn't matter, but 17 seems like an odd thing to pick at random out of 00..ff, does it have any significance? The reason is explained in the comment. And, BTW, you do know about this one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random Number") 2. It seems overly paranoid to be checking that the files in .git/objects/17/ look like a SHA-1. If we have stuff not generated by git in .git/objects/??/ we probably have bigger problems than prematurely triggering auto gc, can this just be removed as redundant. Was this some check e.g. expecting that this would need to deal with tempfiles in these directories that we created at the time (but no longer do?)? It's not about that there are SHA-1s in there, it's about how many there are. -- Hannes
What's so special about objects/17/ ?
In 2007 Junio wrote (https://public-inbox.org/git/7vr6lcj2zi@gitster.siamese.dyndns.org/): +static int need_to_gc(void) +{ + /* +* Quickly check if a "gc" is needed, by estimating how +* many loose objects there are. Because SHA-1 is evenly +* distributed, we can check only one and get a reasonable +* estimate. +*/ + char path[PATH_MAX]; + const char *objdir = get_object_directory(); + DIR *dir; + struct dirent *ent; + int auto_threshold; + int num_loose = 0; + int needed = 0; + + if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) { + warning("insanely long object directory %.*s", 50, objdir); + return 0; + } + dir = opendir(path); + if (!dir) + return 0; + + auto_threshold = (gc_auto_threshold + 255) / 256; + while ((ent = readdir(dir)) != NULL) { + if (strspn(ent->d_name, "0123456789abcdef") != 38 || + ent->d_name[38] != '\0') + continue; + if (++num_loose > auto_threshold) { + needed = 1; + break; + } + } A couple of questions about this patch, which is in git.git as 2c3c439947 ("Implement git gc --auto", 2007-09-05) 1. We still have this check of objects/17/ in builtin/gc.c today. Why objects/17/ and not e.g. objects/00/ to go with other 000* magic such as the SHA-1? Statistically it doesn't matter, but 17 seems like an odd thing to pick at random out of 00..ff, does it have any significance? 2. It seems overly paranoid to be checking that the files in .git/objects/17/ look like a SHA-1. If we have stuff not generated by git in .git/objects/??/ we probably have bigger problems than prematurely triggering auto gc, can this just be removed as redundant. Was this some check e.g. expecting that this would need to deal with tempfiles in these directories that we created at the time (but no longer do?)?