Re: What's cooking in git.git (Nov 2018, #06; Wed, 21)
On Wed, Nov 21, 2018 at 6:08 PM Stephen P. Smith wrote: > > Picking up someones stalled patch is one thing, picking up Linus' patch is in > a different league. No, I think it works the other way - my random patches are likely the _least_ important ones, simply because I can carry them around in my own tree anyway, and my workflows may be odd by most git standards. If I'm the only user of the "human" date format, then Junio is absolutely right to not move it to the main branch. I have some other private patches in my tree that have been mentioned on the list but that nobody else got excited about: - Add 'human' date mode - Make 'git gc --prune=now' mean 'prune after start of command' - ls-remote: add "--diff" option to show only refs that differ none of which are in the least important, but that I find personally useful. Linus
Re: Hash algorithm analysis
On Tue, Sep 18, 2018 at 8:18 AM Joan Daemen wrote: > > 3) The relatively large state in the sponge construction increases the > generic strength against attacks when the input contains redundancy or > has a certain form. For instance, if the input is restricted to be text in > ASCII (such as source code), then the collision-resistance grows > higher than the nominal 2^{c/2}. Such an effect does not exist with > narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.) Answering to just this part: No, what I had in mind was literally just exactly the kind of attack that SHA1 broke for - attacking the internal state vector directly, and not paying any penalty for it, because the stat size is the same as the final hash size. The length extension attack is just the simplest and most trivial version of that kind of attack - because the internal state vector *is* the result, and you just continue using it. But that trivial length extension thing not the real problem, it's just the absolutely simplest symptom of the real problem. I think that the model where the internal state of the hash is the same width as the final result is simply broken. It was what broke SHA1, and that problem is shared with SHA2. "Length extension" is just the simplest way to say "broken by design", imho. Because the length extension attack is just the most trivial attack, but it isn't the fundamental problem. It was just the first and the cheapest attack found, but it was also the most special-cased and least interesting. You need to have a very special case (with that secret at the beginning etc) to make the pure length extension attack interesting. And git has no secrets, so in that sense "length extension" by itself is totally immaterial. But the basic problem of internal hash size obviously wasn't. So I would say that length extension is a direct result of the _real_ problem, which is that the hash exposes _all_ of the internal data. That is what makes length extension possible - because you can just continue from a known state, and there is absolutely nothing hidden - and yes, that's a really easy special case where you don't even need to actually break the hash at all. But I argue that it's _also_ one big part of what made SHAttered practical, and I think the underlying problem is exactly the same. When the internal state is the same size as the hash, you can attack the internal state itself for basically the same cost as attacking the whole hash. So you can pick-and-choose the weakest point. Which is basically exactly what SHAttered did. No, it wasn't the trivial "just add to the end", but it used the exact same underlying weakness as one part of the attack. *This* is why I dislike SHA2. It has basically the exact same basic weakness that we already know SHA1 fell for. The hashing details are different, and hopefully that means that there aren't the same kind of patterns that can be generated to do the "attack the internal hash state" part, but I don't understand why people seem to ignore that other fundamental issue. Something like SHA-512/256 would have been better, but I think almost nobody does that in hardware, which was one of the big advantages of plain SHA2. The main reason I think SHA2 is acceptable is simply that 256 bits is a lot. So even if somebody comes up with a shortcut that weakens it by tens of bits, nobody really cares. Plus I'm obviously not a cryptographer, so I didn't feel like I was going to fight it a lot. But yes, I'd have probably gone with any of the other alternatives, because I think it's a bit silly that we're switching hashes to another hash that has (at least in part) the *exact* same issue as the one people call broken. (And yes, the hashing details are different, so it's "exactly the same" only wrt that internal state part - not the bitpattern finding part that made the attack on the internal state much cheaper. Real cryptographers obviously found that "figure out the weakness of the hashing" to be the more interesting and novel part over the trivial internal hash size part). That said.. The real reason I think SHA2 is the right choice was simply that there needs to be a decision, and none of the choices were *wrong*. Sometimes just the _act_ of making a decision is more important than _what_ the decision is. And hey, it is also likely that the reason _I_ get hung up on just the size of the internal state is that exactly because I am _not_ a cryptographer, that kind of high-level stuff is the part I understand. When you start talking about why the exact rules of Merkle–Damgård constructions work, my eyes just glaze over. So I'm probably - no, certainly - myopic and looking at only one part of the issue to begin with. The end result is that I argued for more bits in the internal state (and apparently wide vs narrow is the technical term), and I would have seen parallel algorithms as a bonus for the large-file case. None of which argued for SHA2. But
Re: "less -F" is broken
On Thu, Aug 16, 2018 at 9:50 AM Mark Nudelman wrote: > > So I'm not sure what the best solution is. Linus's proposal to disable > the line counting stuff if -X is set seems reasonable. I will look into > that and see if there are any issues with it. One option that I didn't try to go for - because I just don't know the less code base well enough - is to basically make the behavior of '-F' be something like this: - as long as all the lines are short and well-behaved, and we haven't seen enough lines to fill the screen, act like 'cat' and just feed them through - when you fill the screen (or when you hit some other condition that makes you go "now I won't exit" - that could be a long line, but maybe it could also be the user giving keyboard input for a less command?) you send the init sequence and just redraw the whole screen. That sounds like the best of both worlds. In fact, right now "less -F" is in my opinion a bit broken in other ways that my patch doesn't fix. Do this: (echo 1; sleep 10; echo 2) | LESS=FX less and with my patch it will show "a" immediately. So far so good. But let's say that that was all the user was interested in, and the user presses 'q' to quit less. That doesn't work at all - it will wait for that full ten seconds. That actually happens even without -F too. Wouldn't it be good to react to things like searches to highlight something (and to 'quit' for the 'never mind, alteady got it' case) even if there isn't enough data to fill the whole screen yet? that said, ^C works, and this is not new behavior, so I'm just throwing this out as a "maybe a different approach would fix _both_ the -F behavior _and_ the above traditional issue"? Linus
Re: "less -F" is broken
On Wed, Aug 15, 2018 at 2:29 PM Ævar Arnfjörð Bjarmason wrote: > > Downloading & trying versions of it locally reveals that it's as of > version 520, not 530. The last version before 520 is 487. Presumably > it's covered by this item in the changelog: > > Don't output terminal init sequence if using -F and file fits on one > screen[1] Side note: that's sad, because we already use X in the default exactly for that reason. So apparently "less" was broken for us to fix something that we already had long solved. The code basically tried to do "automatic X when F is set". And all that line_count stuff (which is what breaks) is pointless when -X is already given. That does give a possible fix: just stop doing the line_count thing if no_init is set. So "-F" would continue to be broken, but "-FX" would work. Something like the attached patch, perhaps? Linus main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/main.c b/main.c index 179bd78..961a9db 100644 --- a/main.c +++ b/main.c @@ -59,6 +59,7 @@ extern int missing_cap; extern int know_dumb; extern int pr_type; extern int quit_if_one_screen; +extern int no_init; /* @@ -274,7 +275,7 @@ main(argc, argv) { if (edit_stdin()) /* Edit standard input */ quit(QUIT_ERROR); - if (quit_if_one_screen) + if (quit_if_one_screen && !no_init) line_count = get_line_count(); } else {
Re: "less -F" is broken
On Wed, Aug 15, 2018 at 2:29 PM Ævar Arnfjörð Bjarmason wrote: > > FWIW they're not linked from > http://www.greenwoodsoftware.com/less/download.html but you can just URL > hack and see releases http://www.greenwoodsoftware.com/less/ and change > links like http://www.greenwoodsoftware.com/less/less-530.tar.gz to > http://www.greenwoodsoftware.com/less/less-520.tar.gz I should have just tried that. I just downloaded the ones linked to, made a git archive of the history, and started bisecting. Which was all pointless extra work, since it was in the last release, but whatever. > > I've reported it as a bug in less, but I'm not sure what the reaction > > will be, the less releases seem to be very random. > > Via bug-l...@gnu.org? Heh. Another thing I didn't actually find. No, I just emailed Mark Nudelman directly, because that's what the FAQ says to do: "There is a list of known bugs here. If you find a bug that is not in the list, please send email to the author. Describe the bug in as much detail as possible, and I'll do what I can to help resolve the problem." and it doesn't mention any mailing list. > Is this report available online somewhere? It was not all that different from the email to the git list - just giving the trivial test-case and my (limited) bisection result. The data you dug up is much more useful. Linus
"less -F" is broken
Sadly, as of less-530, the behavior of "less -F" is broken enough that I think git needs to potentially think about changing the defaults for the pager, or people should at least be aware of it. Older versions of less (at least up to less-487 - March 2017) do not have this bug. There were apparently 520, 527 and 529 releases in 2017 too, but I couldn't find their sources to verify if they were already broken - but 530 (February 2018) has the problem. The breakage is easy to see without git: (echo "hello"; sleep 5; echo "bye bye") | less -F which will result in no output at all for five seconds, and then you get both lines at once as "less" exits. It's not always obvious when using git, because when the terminal fills up, less also starts outputting, but the default options with -F are really horrible if you are looking for something uncommon, and "git log" doesn't respond at all. On the kernel tree, this is easy to see with something like git log --oneline --grep="The most important one is the mpt3sas fix" which takes a bit over 7 seconds before it shows the commit I was looking for. In contrast, if you do LESS=-RX git log --oneline --grep="The most important one is the mpt3sas fix" that (recent) commit is found and shown immediately. It still takes 7s for git to go through all history and decide "that was it", but at least you don't need to wait for the intermediate results. I've reported it as a bug in less, but I'm not sure what the reaction will be, the less releases seem to be very random. Linus
Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
On Fri, Aug 3, 2018 at 9:40 AM Junio C Hamano wrote: > > > [...] > >> - - 20-byte NewHash checksum of all of the above. > >> + - 20-byte SHA-256 checksum of all of the above. > > > > Likewise. > > Hmph, I've always assumed since NewHash plan was written that this > part was not about tamper resistance but was about bit-flipping > detection and it was deliberate to stick to 20-byte sum, truncating > as necessary. Yeah, in fact, since this was one area where people actually had performance issues with the hash, it might be worth considering _weakening_ the hash. Things like the pack index files (and just the regular file index) are entirely local, and the SHA1 in those is really just a fancy CRC. It's really just "good protection against disk corruption" (it happens), and in fact you cannot use it as protection against active tampering, since there's no secret there and any active attacker that has access to your local filesystem could just recompute the hash anyway. And because they are local anyway and aren't really transported (modulo "shared repositories" where you use them across users or legacy rsync-like synchronization), they can be handled separately from any hashing changes. The pack and index file formats have in fact been changed before. It might make sense to either keep it as SHA1 (just to minimize any changes) or if there are still issues with index file performance it could even be made to use something fast-but-not-cryptographic like just making it use crc32(). Remember: one of the original core git design requirements was "corruption detection". For normal hashed objects, that came naturally, and the normal object store additionally has active tamper protection thanks to the interconnected nature of the hashes and the distribution of the objects. But for things like packfiles and the file index, it is just a separate checksum. There is no tamper protection anyway, since anybody who can modify them directly can just recompute the hash checksum. The fact that both of these things used SHA1 was more of a convenience issue than anything else, and the pack/index file checksum is fundamentally not cryptographic (but a cryptographic hash is obviously by definition also a very good corruption detector). Linus
Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
On Fri, Aug 3, 2018 at 12:20 AM Jonathan Nieder wrote: > > > Here I'd want to put a pile of acks, e.g.: > > Acked-by: Linus Torvalds > .. Sure, feel free to add my Ack for this. Linus
Re: [RFC PATCH v2] Add 'human' date format
On Tue, Jul 24, 2018 at 2:49 PM Junio C Hamano wrote: > > But lack of TZ does not give me enough hint about which content it > happened. The fact that this was done late at night on weekend is > indeed interesting, and I may not care what time it locally was for > me, so perhaps this is an intended behaviour. I'm not sure you could call it "intended". The TZ hiding did change as I was playing with this. The first version of the patch only hid the timezone if it matched the current one. Because, as you say, the time zone can be interesting not so much because you care about *when* the commit happened, but you care about *where* the commit happened. And that can be true even if the commit is very old. So the timezone data in some sense isn't necessarily about the date at all. When I used it a bit more (I still have the "--date=auto" as my default for the kernel), I decided I really don't much care about the timezone. In any _individual_ case, the timezone looks fine, but when you look at many different dates, it looks really odd how it sometimes shows, and sometimes does not, particularly for old dates when it really doesn't matter for the *time* reason. So I decided that it's better to not show the timezone at all when you show a real date. But honestly, I don't claim to have a really strong argument. It's just a choice. Nothing says it's the absolute right choice. I pointed out that you can use "--date=human-local" to get an even denser representation that gives you the human date without ever having a TZ. But we don't have the reverse of "-local", which would explicitly show the timezones. Again, I think this is really because the timezone is about something other than just the time. I think the "do we care *where* it was done or not?" question in many ways is entirely independent of the time question. So right now the patch says hide.tz |= !hide.date; which ends up being good for the "times are roughly the same size" (which I decided was a good thing - again, I don't really have a hugely strong argument for it, it was a matter of me playing with options). But it would make equally much sense to say hide.tz |= hide.time; and just say that the timezone is hidden if it matches the current one, or if the commit is just so old that we don't show the time at all. OR you could just say "timezone is always interesting, because you want to know _where_ it was done regardless of _when_ it was done", and just not hide the timezone at all. I think all are "technically valid" choices to make. The one I made was just a random personal preference, not necessarily the right one. Could we extend on the "decorations" (like the "-local" thing)? Absolutely. I'm not sure it's worth doing, but it would certainly solve the "different people have different preferences" issue. I think that *if* we want to extend on the decorations, that would probably still be a separate patch from the basic patch. Linus
Re: Hash algorithm analysis
On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson wrote: > > Switching gears, if I look at this from the perspective of the libgit2 > project, I would also prefer SHA-256 or SHA3 over blake2b. To support > blake2b, we'd have to include - and support - that code ourselves. But > to support SHA-256, we would simply use the system's crypto libraries > that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or > SecureTransport). I think this is probably the single strongest argument for sha256. "It's just there". The hardware acceleration hasn't become nearly as ubiquitous as I would have hoped, and honestly, sha256 _needs_ hw acceleration more than some of the alternatives in the first place. But sha256 does have the big advantage of just having been around and existing in pretty much every single crypto library. So I'm not a huge fan of sha256, partly because of my disappointment in lack of hw acceleration in releant markets (sure, it's fairly common in ARM, but nobody sane uses ARM for development because of _other_ reasons). And partly because I don't like how the internal data size is the same as the final hash. But that second issue is an annoyance with it, not a real issue - in the absence of weaknesses it's a non-issue, and any future weaknesses might affect any other choice too. So hey, if people are actually at the point where the lack of choice holds up development, we should just pick one. And despite what I've said in this discussion, sha256 would have been my first choice, just because it's the "obvious" choice. The exact same way that SHA1 was the obvious choice (for pretty much the same infrastructure reasons) back in 2005. And maybe the hw acceleration landscape will actually improve. I think AMD actually does do the SHA extensions in Zen/TR. So I think Junio should just pick one. And I'll stand up and say "Let's just pick one. And sha256 is certainly the safe choice in that it won't strike anybody as being the _wrong_ choice per se, even if not everybody will necessarily agree it's the _bext_ choice". but in the end I think Junio should be the final arbiter. I think all of the discussed choices are perfectly fine in practice. Btw, the one thing I *would* suggest is that the git community just also says that the current hash is not SHA1, but SHA1DC. Support for "plain" SHA1 should be removed entirely. If we add a lot of new infrastructure to support a new more secure hash, we should not have the old fallback for the known-weak one. Just make SHA1DC the only one git can be built with. Linus
Re: Hash algorithm analysis
On Mon, Jul 23, 2018 at 5:48 AM Sitaram Chamarty wrote: > > I would suggest (a) hash size of 256 bits and (b) choice of any hash > function that can produce such a hash. If people feel strongly that 256 > bits may also turn out to be too small (really?) then a choice of 256 or > 512, but not arbitrary sizes. Honestly, what's the expected point of 512-bit hashes? The _only_ point of a 512-bit hash is that it's going to grow objects in incompressible ways, and use more memory. Just don't do it. If somebody can break a 256-bit hash, you have two choices: (a) the hash function itself was broken, and 512 bits isn't the solution to it anyway, even if it can certainly hide the problem (b) you had some "new math" kind of unexpected breakthrough, which means that 512 bits might not be much better either. Honestly, the number of particles in the observable universe is on the order of 2**256. It's a really really big number. Don't make the code base more complex than it needs to be. Make a informed technical decision, and say "256 bits is a *lot*". The difference between engineering and theory is that engineering makes trade-offs. Good software is well *engineered*, not theorized. Also, I would suggest that git default to "abbrev-commit=40", so that nobody actually *sees* the new bits by default. So the perl scripts etc that use "[0-9a-f]{40}" as a hash pattern would just silently continue to work. Because backwards compatibility is important (*) Linus (*) And 2**160 is still a big big number, and hasn't really been a practical problem, and SHA1DC is likely a good hash for the next decade or longer.
Re: Hash algorithm analysis
On Sat, Jul 21, 2018 at 3:39 PM Johannes Schindelin wrote: > > Do you really want to value contributors' opinion more than > cryptographers'? I mean, that's exactly what got us into this hard-coded > SHA-1 mess in the first place. Don't be silly. Other real cryptographers consider SHA256 to be a problem. Really. It's amenable to the same hack on the internal hash that made for the SHAttered break. So your argument that "cryptographers prefer SHA256" is simply not true. Your real argument is that know at least one cryptographer that you work with that prefers it. Don't try to make that into some generic "cryptographers prefer it". It's not like cryptographers have issues with blake2b either, afaik. And blake2b really _does_ have very real advantages. If you can actually point to some "a large percentage of cryptographers prefer it", you'd have a point. But as it is, you don't have data, you have an anecdote, and you try to use that anecdote to put down other peoples opinions. Intellectually dishonest, that is. Linus
Re: [RFC PATCH v2] Add 'human' date format
On Wed, Jul 11, 2018 at 2:24 PM Ævar Arnfjörð Bjarmason wrote: > > I think that's true for the likes of linux.git & git.git, but a lot of > users of git say work in some corporate setting entirely or mostly in > the same timezone. > > In that case, knowing if some commit whose sole message was "fix"[1] was > made at 3am or in the afternoon, even if it's really old, is really > useful information, even years later. Heh. Maybe. But if you care about that kind of information, would you actually want to use the "human" date? Wouldn't you want to use the strftime thing instead, which gets you whatever field you care about, and gets it consistently regardless of how old the data is? That said, I do acknowledge that the "human" format may be a bit inflexible and ad-hoc. Of course some more generic way that allowed arbitrary rules might be better for some uses. I'll just explain the cases that made me zero in on what that last patch did: (a) I do like the "relative" date for recent stuff. Quite often, I look at how recent the commits are, for example, and then I really like seeing "2 hours ago" rather than a time with a timezone (which is a lot harder for me to mentally parse) This was the primary impetus for my original "auto" patch many years ago, that was (rightly) not merged. It really boiled down to just "default or relative, depending on how recent it was". (b) I noticed that I was distracted by dates that were *too* terse. My first patch had _just_ the time when it was today and the same timezone (but older than two hours, so the original relative logic didn't trigger). That initially sounded great to me, which is why it was that first time. But after _using_ it for a while, I actually found that it didn't have enough context for me (visually) to really trigger my date parsing at all. So "five hours ago" actually parsed better than just "9:48" to me. I didn't think it would do that, but it did. Which was why I changed the "relative" time to trigger every time if it was the exact same date (and in the past) - just to avoid the really terse model. (c) when I played around with other commands than just "git log", I also noticed that a consistent length mattered., Again, my first version was more along the lines of "if it's long ago, just use the full format, exactly like the default date". It wasn't *quite* that, because it would always skip the seconds, but it was close. And with "git log", that worked fine, because dates were fairly uniformly increasing, so the date format would slowly get longer, and that was fine. But then when I played with "git blame -C --date=human", I noticed that not only did the human date actually make sense there too, it actually made it easier for me to read - and that in particular, the "extra" info was just annoying. So now I find that shortened "only show the date" format to be really good _particularly_ for "git blame". You can see very clearly whether it's something recent or something old. Maybe my use of "git blame" is unusual, but I don't think so. I tend to do "git blame -C" when I'm looking for a bug, and then seeing something like this: ... Apr 16 2005 437) Apr 16 2005 438) Jan 14 2016 439) Apr 16 2005 440) Apr 16 2005 441) Apr 16 2005 442) Thu Jun 14 15:26 443) Thu Jun 14 15:26 444) Thu Jun 14 15:26 445) Thu Jun 14 15:26 446) Thu Jun 14 15:26 447) Thu Jun 14 15:26 448) Thu Jun 14 15:26 449) Thu Jun 14 15:26 450) Apr 16 2005 451) Jul 30 2012 452) Jul 30 2012 453) Feb 13 2012 454) Apr 16 2005 455) Apr 16 2005 456) in that date field (yeah. that happens to be "kernel/fork.c" in the current kernel - I just edited out all the other stuff than time and line number) is actually very visually easy to see what parts are old, and which ones are recent, because it changes the format pretty clearly and unambiguously, without changing the size of that field _dramatically_. (Sure, the size changes, but it's not a radical difference, it's a fairly small variation, and the variation only highlights the different time range, without making it compltely unbalanced). Anyway, enough excuses. I'l just trying to explain some of the things that I noticed simply _while_ making some of the decisions I made. Are they the "right" decisions? I don't know. But I've been running with that git config --add log.date auto in my kernel repo since I posted the patches, and so far I'm still liking it. Linus
Re: [RFC PATCH v2] Add 'human' date format
[ Trying to come up with crazy special cases ] On Wed, Jul 11, 2018 at 1:49 PM Linus Torvalds wrote: > > But it could be anything else invalid, of course. It could be MAX_INT > or something like that. That might be better. A timezone of -1 isn't actually a valid timezone, but I guess you could create a commit by hand that had "-0001" as the timezone. You can't do that with something like MAX_INT, without fsck complaining - since it has to be exactly four digits. > The clearing of "human_tm" is done for a similar reason: the code does > > hide.year = tm->tm_year == human_tm->tm_year; > > (and then later just checks "if (human_tm->tm_year)") knowing that a > non-zero tm_year will only ever happen for human_tz (and that 1900 is > not a valid git date, even though I guess in theory you could do it). Actually, the 1900 should be safe, because 'timestamp_t' is unsigned. So a valid timestamp really can't be before 1970. Of course, you can probably try to mess with it by giving values that don't actually fit, because sometimes we do convert mindlessly from 'timestamp_t' to 'time_t'. In particular, if you use the "default-local" time, it will use that static struct tm *time_to_tm_local(timestamp_t time) { time_t t = time; return localtime(); } and not check the range of the timestamp. But other proper time stamp functions will actually do range checking with "date_overflow()", so in general that whole assumption of "a real git date cannot be in the year 1900" is valid. Linus
Re: [RFC PATCH v2] Add 'human' date format
On Wed, Jul 11, 2018 at 1:34 PM Andrei Rybak wrote: > > > + int human_tz = -1; > > Is -1 an OK initial value for timezone if local_time_tzoffset returns > negative values as well? It looks like it doesn't matter for from functional The value was intentionally picked to *not* be a valid timezone value, so that the comparison of "human_tz == tz" would always fail if DATE_HUMAN is not selected. But it could be anything else invalid, of course. It could be MAX_INT or something like that. By picking something that isn't possibly a real timezone value, late code can do things like hide.tz = local || tz == human_tz; without worrying about whther it's really DATE_HUMAN or not. The clearing of "human_tm" is done for a similar reason: the code does hide.year = tm->tm_year == human_tm->tm_year; (and then later just checks "if (human_tm->tm_year)") knowing that a non-zero tm_year will only ever happen for human_tz (and that 1900 is not a valid git date, even though I guess in theory you could do it). Linus
[RFC PATCH v2] Add 'human' date format
From: Linus Torvalds This adds --date=human, which skips the timezone if it matches the current time-zone, and doesn't print the whole date if that matches (ie skip printing year for dates that are "this year", but also skip the whole date itself if it's in the last few days and we can just say what weekday it was). For really recent dates (same day), use the relative date stamp, while for old dates (year doesn't match), don't bother with time and timezone. Also add 'auto' date mode, which defaults to human if we're using the pager. So you can do git config --add log.date auto and your "git log" commands will show the human-legible format unless you're scripting things. Note that this time format still shows the timezone for recent enough events (but not so recent that they show up as relative dates). You can combine it with the "-local" suffix to never show timezones for an even more simplified view. Signed-off-by: Linus Torvalds --- Slightly updated version after playing with this more. This tries to make the length somewhat more consistent (and shorter), which came about when looking at this in "git blame" output. Once you're talking "last year" patches, you don't tend to care about time of day or timezone. So the longest date is basically "Thu Oct 19 16:00", because if you show the year (four characters), you don't show the time (five characters). And the timezone (five characters) is only shown if not showing the date (5-6 characters). Also, because the relative time is now handled entirely inside the show_date_normal() function, I could undo some of the changes to show_date() that were updating date->mode and date->local. So the patch has actually shrunk a bit, I think. builtin/blame.c | 4 ++ cache.h | 1 + date.c | 130 3 files changed, 115 insertions(+), 20 deletions(-) diff --git a/builtin/blame.c b/builtin/blame.c index 5a0388aae..7b6235321 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -917,6 +917,10 @@ int cmd_blame(int argc, const char **argv, const char *prefix) */ blame_date_width = utf8_strwidth(_("4 years, 11 months ago")) + 1; /* add the null */ break; + case DATE_HUMAN: + /* If the year is shown, no time is shown */ + blame_date_width = sizeof("Thu Oct 19 16:00"); + break; case DATE_NORMAL: blame_date_width = sizeof("Thu Oct 19 16:00:04 2006 -0700"); break; diff --git a/cache.h b/cache.h index d49092d94..8a6810ee6 100644 --- a/cache.h +++ b/cache.h @@ -1428,6 +1428,7 @@ extern struct object *peel_to_type(const char *name, int namelen, struct date_mode { enum date_mode_type { DATE_NORMAL = 0, + DATE_HUMAN, DATE_RELATIVE, DATE_SHORT, DATE_ISO8601, diff --git a/date.c b/date.c index 49f943e25..4486c028a 100644 --- a/date.c +++ b/date.c @@ -77,22 +77,16 @@ static struct tm *time_to_tm_local(timestamp_t time) } /* - * What value of "tz" was in effect back then at "time" in the - * local timezone? + * Fill in the localtime 'struct tm' for the supplied time, + * and return the local tz. */ -static int local_tzoffset(timestamp_t time) +static int local_time_tzoffset(time_t t, struct tm *tm) { - time_t t, t_local; - struct tm tm; + time_t t_local; int offset, eastwest; - if (date_overflows(time)) - die("Timestamp too large for this system: %"PRItime, time); - - t = (time_t)time; - localtime_r(, ); - t_local = tm_to_time_t(); - + localtime_r(, tm); + t_local = tm_to_time_t(tm); if (t_local == -1) return 0; /* error; just use + */ if (t_local < t) { @@ -107,6 +101,20 @@ static int local_tzoffset(timestamp_t time) return offset * eastwest; } +/* + * What value of "tz" was in effect back then at "time" in the + * local timezone? + */ +static int local_tzoffset(timestamp_t time) +{ + struct tm tm; + + if (date_overflows(time)) + die("Timestamp too large for this system: %"PRItime, time); + + return local_time_tzoffset((time_t)time, ); +} + void show_date_relative(timestamp_t time, int tz, const struct timeval *now, struct strbuf *timebuf) @@ -191,9 +199,80 @@ struct date_mode *date_mode_from_type(enum date_mode_type type) return } +static void show_date_normal(struct strbuf *buf, timestamp_t time, struct tm *tm, int tz, struct tm *human_tm, int human_tz, int local) +{ + struct { + unsigned intyear:1, + date:1,
Re: [RFC PATCH] Add 'human' date format
On Sat, Jul 7, 2018 at 12:58 PM Linus Torvalds wrote: > > I'm playing with making all "today" dates just use the relative > format. Here's the incremental patch for that if people want to compare the output. With this, you never get the "just time" case, because that will turn into "2 hours ago" or similar. But you will get "Fri 19:45" for something that happened yesterday. So examples from my kernel logs look something like this: 2 hours ago Fri 19:45 Fri 10:44 +1000 Fri Jun 22 15:46 Tue Jun 19 15:41 -0600 Thu Jun 15 12:57 2017 +0300 depending on how long ago they were and whether they were in the same timezone etc. Linus date.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/date.c b/date.c index 9809ac334..de0b03cf4 100644 --- a/date.c +++ b/date.c @@ -199,7 +199,7 @@ struct date_mode *date_mode_from_type(enum date_mode_type type) return } -static void show_date_normal(struct strbuf *buf, struct tm *tm, int tz, struct tm *human_tm, int human_tz, int local) +static void show_date_normal(struct strbuf *buf, timestamp_t time, struct tm *tm, int tz, struct tm *human_tm, int human_tz, int local) { struct { unsigned int year:1, @@ -225,6 +225,14 @@ static void show_date_normal(struct strbuf *buf, struct tm *tm, int tz, struct t } } + /* Show "today" times as just relative times */ + if (hide.wday) { + struct timeval now; + gettimeofday(, NULL); + show_date_relative(time, tz, , buf); + return; + } + /* Always hide seconds for human-readable */ hide.seconds = human_tm->tm_year > 0; @@ -268,10 +276,6 @@ const char *show_date(timestamp_t time, int tz, const struct date_mode *mode) /* Fill in the data for "current time" in human_tz and human_tm */ human_tz = local_time_tzoffset(now.tv_sec, _tm); - /* Special case: if it's less than an hour ago, use relative time */ - if (time - now.tv_sec < 60 * 60) - type = DATE_RELATIVE; - /* Don't print timezone if it matches */ if (tz == human_tz) local = 1; @@ -333,7 +337,7 @@ const char *show_date(timestamp_t time, int tz, const struct date_mode *mode) strbuf_addftime(, mode->strftime_fmt, tm, tz, !local); else - show_date_normal(, tm, tz, _tm, human_tz, local); + show_date_normal(, time, tm, tz, _tm, human_tz, local); return timebuf.buf; }
Re: [RFC PATCH] Add 'human' date format
On Sat, Jul 7, 2018 at 12:39 PM Linus Torvalds wrote: > to me, but with "--date=human", right now it just says > > Date: 12:21 Side note: this is probably my least favorite of the formats. I'm playing with making all "today" dates just use the relative format, and then the "a couple of days ago" dates would then have the > Date: Fri 19:45 format. But since it's _explicitly_ about a "human legible" format, I think the format could be a bit fluid, and things like that might be tweaked later. Anybody who would script this would be crazy. I'm more looking for "no, that's just stupid" comments, or "no, you're not the only one who has wanted this" kinds of replies. Linus
[RFC PATCH] Add 'human' date format
From: Linus Torvalds This adds --date=human, which skips the timezone if it matches the current time-zone, and doesn't print the whole date if that matches (ie skip printing year for dates that are "this year", but also skip the whole date itself if it's in the last few days and we can just say what weekday it was). For really recent dates (within the last hour), use the relative date stamp. Also add 'auto' date mode, which defaults to human if we're using the pager. So you can do git config --add log.date auto and your "git log" commands will show the human-legible format unless you're scripting things. Signed-off-by: Linus Torvalds --- Ok, I tried something like this long long ago, but that was a fairly nasty patch that just defaulted to "--date=relative" for recent dates, and then did the usual default for anything else. But the issue kept bugging me, and this is a slightly more sane model (?). It keeps the notion of "let's use --date=relative for very recent dates", but for anything that is more than an hour old, it just uses a simplified date. For example, for time stamps that are "today", just show the time. And if the year or the time zone matches the current year or timezone, skip them. And don't bother showing seconds, because humans won't care. The end result is a slightly simplified view. So for example, the date for this commit right now looks like Date: Sat Jul 7 12:21:26 2018 -0700 to me, but with "--date=human", right now it just says Date: 12:21 (and maybe for a US locale, I should make it do the AM/PM thing). This is marked RFC because (a) maybe I'm the only one who has ever wanted the simplified dates (b) the simplification rules themselves might be worthy of discussion. For example, I also simplified the "a couple of days ago" case, so that it shows Date: Fri 19:45 for one of my kernel commits that happened yesterday. The date didn't match _exactly_, but it's the same month, and just a few days ago, so it just shows the weekday. Is that easier to read? Maybe. I kind of like it. builtin/blame.c | 1 + cache.h | 1 + date.c | 139 +--- 3 files changed, 110 insertions(+), 31 deletions(-) diff --git a/builtin/blame.c b/builtin/blame.c index 5a0388aae..27c64b1c8 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -917,6 +917,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix) */ blame_date_width = utf8_strwidth(_("4 years, 11 months ago")) + 1; /* add the null */ break; + case DATE_HUMAN: case DATE_NORMAL: blame_date_width = sizeof("Thu Oct 19 16:00:04 2006 -0700"); break; diff --git a/cache.h b/cache.h index d49092d94..8a6810ee6 100644 --- a/cache.h +++ b/cache.h @@ -1428,6 +1428,7 @@ extern struct object *peel_to_type(const char *name, int namelen, struct date_mode { enum date_mode_type { DATE_NORMAL = 0, + DATE_HUMAN, DATE_RELATIVE, DATE_SHORT, DATE_ISO8601, diff --git a/date.c b/date.c index 49f943e25..9809ac334 100644 --- a/date.c +++ b/date.c @@ -77,22 +77,16 @@ static struct tm *time_to_tm_local(timestamp_t time) } /* - * What value of "tz" was in effect back then at "time" in the - * local timezone? + * Fill in the localtime 'struct tm' for the supplied time, + * and return the local tz. */ -static int local_tzoffset(timestamp_t time) +static int local_time_tzoffset(time_t t, struct tm *tm) { - time_t t, t_local; - struct tm tm; + time_t t_local; int offset, eastwest; - if (date_overflows(time)) - die("Timestamp too large for this system: %"PRItime, time); - - t = (time_t)time; - localtime_r(, ); - t_local = tm_to_time_t(); - + localtime_r(, tm); + t_local = tm_to_time_t(tm); if (t_local == -1) return 0; /* error; just use + */ if (t_local < t) { @@ -107,6 +101,20 @@ static int local_tzoffset(timestamp_t time) return offset * eastwest; } +/* + * What value of "tz" was in effect back then at "time" in the + * local timezone? + */ +static int local_tzoffset(timestamp_t time) +{ + struct tm tm; + + if (date_overflows(time)) + die("Timestamp too large for this system: %"PRItime, time); + + return local_time_tzoffset((time_t)time, ); +} + void show_date_relative(timestamp_t time, int tz, const struct timeval *now, struct strbuf *timebuf) @@ -191,27 +199,94 @@ struct date_mode *date_mode_from_type(enum date_mode_type type) return } +static void show_date_no
Re: Hash algorithm analysis
On Mon, Jun 11, 2018 at 4:27 PM Ævar Arnfjörð Bjarmason wrote: > > > > And no, I'm not a cryptographer. But honestly, length extension > > attacks were how both md5 and sha1 were broken in practice, so I'm > > just going "why would we go with a crypto choice that has that known > > weakness? That's just crazy". > > What do you think about Johannes's summary of this being a non-issue for > Git in > https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/ > ? I agree that the fact that git internal data is structured and all meaningful (and doesn't really have ignored state) makes it *much* harder to attack the basic git objects, since you not only have to generate a good hash, the end result has to also *parse* and there is not really any hidden non-parsed data that you can use to hide the attack. And *if* you are using git for source code, the same is pretty much true even for the blob objects - an attacking object will stand out like a sore thumb in "diff" etc. So I don't disagree with Johannes in that sense: I think git does fundamentally tend to have some extra validation in place, and there's a reason why the examples for both the md5 and the sha1 attack were pdf files. That said, even if git internal ("metadata") objects like trees and commits tend to not have opaque parts to them and are thus pretty hard to attack, the blob objects are still an attack vector for projects that use git for non-source-code (and even source projects do embed binary files - including pdf files - even though they might not be "as interesting" to attack). So you do want to protect those too. And hey, protecting the metadata objects is good just to protect against annoyances. Sure, you should always sanity check the object at receive time anyway, but even so, if somebody is able to generate a blob object that hashes to the same hash as a metadata object (ie tree or commit), that really could be pretty damn annoying. And the whole "intermediate hashed state is same size as final hash state" just _fundamentally_ means that if you find a weakness in the hash, you can now attack that weakness without having to worry about the attack being fundamentally more expensive. That's essentially what SHAttered relied on. It didn't rely on a secret and a hash and length extension, but it *did* rely on the same mechanism that a length extension attack relies on, where you can basically attack the state in the middle with no extra cost. Maybe some people don't consider it a length extension attack for that reason, but it boils down to much the same basic situation where you can attack the internal hash state and cause a state collision. And you can try to find the patterns that then cause that state collision when you've found a weakness in the hash. With SHA3 or k12, you can obviously _also_ try to attack the hash state and cause a collision, but because the intermediate state is much bigger than the final hash, you're just making things *way* harder for yourself if you try that. Linus
Re: Hash algorithm analysis
On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder wrote: > > Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how > it is constructed. Yeah, I really think that it's a mistake to switch to something that has the same problem SHA1 had. That doesn't necessarily mean SHA3, but it does mean "bigger intermediate hash state" (so no length extension attack), which could be SHA3, but also SHA-512/256 or K12. Honestly, git has effectively already moved from SHA1 to SHA1DC. So the actual known attack and weakness of SHA1 should simply not be part of the discussion for the next hash. You can basically say "we're _already_ on the second hash, we just picked one that was so compatible with SHA1 that nobody even really noticed. The reason to switch is (a) 160 bits may not be enough (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch. (c) others? Obviously all of the choices address (a). But at least for me, (b) makes me go "well, SHA2 has the exact same weak inter-block state attack, so if there are unknown weaknesses in SHA1, then what about unknown weaknesses in SHA2"? And no, I'm not a cryptographer. But honestly, length extension attacks were how both md5 and sha1 were broken in practice, so I'm just going "why would we go with a crypto choice that has that known weakness? That's just crazy". >From a performance standpoint, I have to say (once more) that crypto performance actually mattered a lot less than I originally thought it would. Yes, there are phases that do care, but they are rare. For example, I think SHA1 performance has probably mattered most for the index and pack-file, where it's really only used as a fancy CRC. For most individual object cases, it is almost never an issue. >From a performance angle, I think the whole "256-bit hashes are bigger" is going to be the more noticeable performance issue, just because things like delta compression and zlib - both of which are very *real* and present performance issues - will have more data that they need to work on. The performance difference between different hashing functions is likely not all that noticeable in most common cases as long as we're not talking orders of magnitude. And yes, I guess we're in the "approaching an order of magnitude" performance difference, but we should actually compare not to OpenSSL SHA1, but to SHA1DC. See above. Personally, the fact that the Keccak people would suggest K12 makes me think that should be a front-runner, but whatever. I don't think the 128-bit preimage case is an issue, since 128 bits is the brute-force cost for any 256-bit hash. But hey, I picked sha1 to begin with, so take any input from me with that historical pinch of salt in mind ;) Linus
Re: [RFC] git gc "--prune=now" semantics considered harmful
On Fri, Jun 1, 2018 at 2:04 AM Jeff King wrote: > > We'd also accept relative times like "5.minutes.ago" (in fact, the > default is a relative 2.weeks.ago, though it's long enough that the > difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter > much). So we probably ought to just normalize _everything_ without even > bothering to match "now". It's a noop for non-relative times, but that's > OK. Heh. That's what I tried to do at first. It's surprisingly hard. You can't normalize it as a date, because we have a few times that aren't expressible as dates because they are just the maximum value (ie "all"). And then I tried to just express it just as a standard numerical value, which we accept on input. But we *only* accept that if it's more than eight digits. And regardless, you need to special-case "now", since parse_expiry_date() special cases it. Or you'd need to just make another version of parse_expiry_date() entirely. End result: I only special-cased "now". > There are still possibilities for a race, even with the grace period. Agreed. But you *really* have to work at it. Linus
Re: [RFC] git gc "--prune=now" semantics considered harmful
On Sat, May 26, 2018 at 4:31 PM Junio C Hamanowrote: > *That* is something I don't do. After all, I am fully aware that I > have started end-of-day ritual by that time, so I won't even look at > a new patch (or a pull request for that matter). Sounds like you're more organized about the end-of-day ritual than I am. For me the gc is not quite so structured. > I however have to wonder if there are opposite "oops" end-user > operation we also need to worry about, i.e. we are doing a large-ish > fetch, and get bored and run a gc fron another terminal. Perhaps > *that* is a bit too stupid to worry about? Auto-gc deliberately > does not use 'now' because it wants to leave a grace period to avoid > exactly that kind of race. For me, a "pull" never takes that long. Sure, any manual merging and the writing of the commit message might take a while, but it's "foreground" activity for me, I'd not start a gc in the middle of it. So at least to me, doing "git fsck --full" and "git gc --prune=now" are somewhat special because they take a while and tend to be background things that I "start and forget" about (the same way I sometimes start and forget a kernel build). Which is why that current "git gc --prune=now" behavior seems a bit dangerous. Linus
[RFC] git gc "--prune=now" semantics considered harmful
So this is a RFC patch, I'm not sure how much people really care, but I find the current behavior of "git gc --prune=now" to be unnecessarily dangerous. There's two issues with it: (a) parse_expiry_date() considers "now" to be special, and it actually doesn't mean "now" at all, it means "everything". (b) the date parsing isn't actually done "now", it's done *after* gc has already run, and we run "git prune --expire". So even if (a) wasn't true, "--prune=now" wouldn't actually mean "now" when the user expects it to happen, but "after doing repacking". I actually think that the "parse_expiry_date()" behavior makes sense within the context of "git prune --expire", so I'm not really complaining about (a) per se. I just think that what makes sense within the context of "git prune" does *not* necessarily make sense within the context of "git gc". Why do I care? I end up doing lots of random things in my local repository, and I also end up wanting to keep my repository fairly clean, so I tend to do "git gc --prune=now" to just make sure everything is packed and I've gotten rid of all the temporary stuff that so often happens when doing lots of git merges (which is what I do). You won't see those temporary objects for the usual trivial merges, but whenever you have a real recursive merge with automated conflict resolution, there will be things like those temporary merge-only objects for the 3-way base merge state. Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But it's what I've gotten used to, and it's at least not entirely insane. But at least once now, I've done that "git gc" at the end of the day, and a new pull request comes in, so I do the "git pull" without even thinking about the fact that "git gc" is still running. And then the "--prune=now" behavior is actually really pretty dangerous. Because it will prune *all* unreachable objects, even if they are only *currently* unreachable because they are in the process of being unpacked by the concurrent "git fetch" (and I didn't check - I might just have been unlocky, bit I think "git prune" ignores FETCH_HEAD). So I actually would much prefer that foir git gc, "--prune=now" means (a) "now" (b) now at the _start_ of the "git gc" operation, not the time at the _end_ of the operation when we've already spent a minute or two doing repacking and are now doing the final pruning. anyway, with that explanation in mind, I'm appending a patch that is pretty small and does that. It's a bit hacky, but I think it still makes sense. Comments? Note that this really isn't likely very noticeable on most projects. When I do "git gc" on a fairly well-packed repo of git itself, it takes under 4s for me. So the window for that whole "do git pull at the same time" is simply not much of an issue. For the kernel, "git gc" takes a minute and a half on the same machine (again, this is already a packed repo, it can be worse). So there's a much bigger window there to do something stupid, Linus --- builtin/gc.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/builtin/gc.c b/builtin/gc.c index c4777b244..98368c8b5 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (argc > 0) usage_with_options(builtin_gc_usage, builtin_gc_options); - if (prune_expire && parse_expiry_date(prune_expire, )) - die(_("failed to parse prune expiry value %s"), prune_expire); + if (prune_expire) { + if (!strcmp(prune_expire, "now")) + prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601)); + if (parse_expiry_date(prune_expire, )) + die(_("failed to parse prune expiry value %s"), prune_expire); + } if (aggressive) { argv_array_push(, "-f");
Re: Silly "git gc" UI issue.
On Wed, Apr 18, 2018 at 7:16 PM, Junio C Hamanowrote: > A few commands that parse --expire= command line option > behaves silly when given nonsense input. For example So this patch definitely improves on the error message. But look at what happens for the kernel: [torvalds@i7 linux]$ time git gc --prune=npw Counting objects: 6006319, done. Delta compression using up to 8 threads. Compressing objects: 100% (912166/912166), done. Writing objects: 100% (6006319/6006319), done. Total 6006319 (delta 5050577), reused 6006319 (delta 5050577) fatal: malformed expiration date 'npw' error: failed to run prune real1m4.376s user0m59.963s sys 0m5.182s Yes, I get that nice "malformed expiration date 'npw'" error, but I get it after 64 seconds has passed. So i think builtin/gc.c should use this same parse_expiry_date() parse_opt_expiry_date_cb() thing for its timestamp parsing. It does actually seem to do that for the gc_log_expire value that it loads from the config file. Maybe something like the attached patch? Then I get: [torvalds@i7 linux]$ time git gc --prune=npw fatal: Failed to parse prune expiry value npw real0m0.004s user0m0.002s sys 0m0.002s and you could smush it into your commit (if you want my sign-off, take it) Linus builtin/gc.c | 4 1 file changed, 4 insertions(+) diff --git a/builtin/gc.c b/builtin/gc.c index 3e67124ea..a4b20aaaf 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -354,6 +354,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) const char *name; pid_t pid; int daemonized = 0; + timestamp_t dummy; struct option builtin_gc_options[] = { OPT__QUIET(, N_("suppress progress reporting")), @@ -392,6 +393,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (argc > 0) usage_with_options(builtin_gc_usage, builtin_gc_options); + if (parse_expiry_date(prune_expire, )) + die(_("Failed to parse prune expiry value %s"), prune_expire); + if (aggressive) { argv_array_push(, "-f"); if (aggressive_depth > 0)
Re: Silly "git gc" UI issue.
On Wed, Apr 18, 2018 at 6:52 PM, Junio C Hamanowrote: > > Regardless of your originai "git gc" issue, we should make "prune" > say something on this error. And when we do so, I would think that > error message will come before the final "error: failed to run > prune". So to me, the real failure is the fact that it spent a a lot of time packing my repository before it then failed the prune at the end. I don't actually mind the quality of the error message too much - although it could be improved. I mind the "oh, goddamnit, you just spent over a minute of CPU time on my fairly high-end desktop, and _then_ you decided to tell me that I'm a moron and couldn't type 'now' correctly". So to me, the big deal would be that builtin/gc.c should validate the date *before* it starts, instead of doing all that work, and then executing "git prune" with invalid arguments.. Linus
Silly "git gc" UI issue.
Ok, this is ridiculous, but I've done it several times, so I thought I'd finally mention it to somebody on the git list that may care: "My name is Linus, and I'm a klutz". what does that have to do with anything? Now, imagine you're a klutz. Imagine you want to clean up your .git directory. Combine those things, and what do you get? You get this: git gc --prune=npw Yeah, that "npw" should be "now", which is where the klutz thing comes in. It turns out that git reacts ridiculously badly to this. I'm just assuming that everybody else is scarily competent if I'm the first to have reported this. Linus
Re: Optimizing writes to unchanged files during merges?
On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamanowrote: > > I think Elijah's corrected was_tracked() also does not care "has > this been renamed". I'm perfectly happy with the slightly smarter patches. My patch was really just an RFC and because I had tried it out. > One thing that makes me curious is what happens (and what we want to > happen) when such a "we already have the changes the side branch > tries to bring in" path has local (i.e. not yet in the index) > changes. For a dirty file that trivially merges (e.g. a path we > modified since our histories forked, while the other side didn't do > anything, has local changes in the working tree), we try hard to > make the merge succeed while keeping the local changes, and we > should be able to do the same in this case, too. I think it might be nice, but probably not really worth it. I find the "you can merge even if some files are dirty" to be really convenient, because I often keep stupid test patches in my tree that I may not even intend to commit, and I then use the same tree for merging. For example, I sometimes end up editing the Makefile for the release version early, but I won't *commit* that until I actually cut the release. But if I pull some branch that has also changed the Makefile, it's not worth any complexity to try to be nice about the dirty state. If it's a file that actually *has* been changed in the branch I'm merging, and I'm more than happy to just stage the patch (or throw it away - I think it's about 50:50 for me). So I don't think it's a big deal, and I'd rather have the merge fail very early with "that file has seen changes in the branch you are merging" than add any real complexity to the merge logic. Linus
Re: Optimizing writes to unchanged files during merges?
On Fri, Apr 13, 2018 at 10:39 AM, Stefan Bellerwrote: > > Would s/read/xread/ make sense in working_tree_matches ? Makes sense, yes. That patch was really more of a RFD than anything that should be applied. I would like to see the "might be same" flag pushed down so that we don't do this comparison when it makes no sense, even if the stat info makes that less of an issue than it might otherwise be, And maybe that whole function should be taught to do the "verify CE against file", and do the proper full cache state check (ie time etc) instead of doing the read. So maybe the best option is a combination of the two patches: my "verify the working tree state" _together_ with the was_tracked() logic that looks up a cache entry. Again, the problem with "look up the index state" is about files possibly being renamed as part of the merge, but if we then check the index state against the actual contents of the working directory, that isn't an issue. Linus
Re: Optimizing writes to unchanged files during merges?
On Fri, Apr 13, 2018 at 12:02 AM, Elijah Newrenwrote: > > I hope you don't mind me barging into your conversation I was getting tired of my own rambling anyway.. > However, it turns out we have this awesome function called > "was_tracked(const char *path)" that was intended for answering this > exact question. So, assuming was_tracked() isn't buggy, the correct > patch for this problem would look like: Apparently that causes problems, for some odd reason. I like the notion of checking the index, but it's not clear that the index is reliable in the presence of renames either. > A big series > including that patch was merged to master two days ago, but > unfortunately that exact patch was the one that caused some > impressively awful fireworks[1]. Yeah, so this code is fragile. How about we take a completely different approach? Instead of relying on fragile (but clever) tests, why not rely on stupid brute force? Yeah, yeah, it's bad to be stupid, but sometimes simple and stupid really does work. See the attached patch. It gets rid of all the subtle "has this been renamed" tests entirely, and just _always_ does that final update_file(). But instead, it makes update_file() a bit smarter, and says "before we write this file out, let's see if it's already there and already has the expected contents"? Now, it really shouldn't be _quite_ as stupid as that: we should probably set a flag in the "hey, the oid matches, maybe it's worth checking", so that it doesn't do the check in the cases where we know the merge has done things. But it's actually *fairly* low cost, because before it reads the file it at least checks that file length matches the expected length (and that the user permission bits match the expected mode). So if the file doesn't match, most of the time the real cost will just be an extra 'open/fstat/close()' sequence. That's pretty cheap. So even the completely stupid approach is probably not too bad, and it could be made smarter. NOTE! I have *NOT* tested this on anything interesting. I tested the patch on my stupid test-case, but that'[s it. I didn't even bother re-doing the kernel merge that started this. Comments? Because considering the problems this code has had, maybe "stupid" really is the right approach... [ Ok, I lied. I just tested this on the kernel merge. It worked fine, and avoided modifying ] Linus merge-recursive.c | 42 +- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/merge-recursive.c b/merge-recursive.c index 0c0d48624..ed2200065 100644 --- a/merge-recursive.c +++ b/merge-recursive.c @@ -815,6 +815,32 @@ static int make_room_for_path(struct merge_options *o, const char *path) return err(o, msg, path, _(": perhaps a D/F conflict?")); } +static int working_tree_matches(const char *path, const char *buf, unsigned long size, unsigned mode) +{ + int fd, matches; + struct stat st; + + fd = open(path, O_RDONLY); + if (fd < 0) + return 0; + matches = 0; + if (!fstat(fd, ) && st.st_size == size && S_ISREG(st.st_mode) && !(0700 & (st.st_mode ^ mode))) { + char tmpbuf[1024]; + while (size) { + int n = read(fd, tmpbuf, sizeof(tmpbuf)); + if (n <= 0 || n > size) +break; + if (memcmp(tmpbuf, buf, n)) +break; + buf += n; + size -= n; + } + matches = !size; + } + close(fd); + return matches; +} + static int update_file_flags(struct merge_options *o, const struct object_id *oid, unsigned mode, @@ -856,6 +882,8 @@ static int update_file_flags(struct merge_options *o, size = strbuf.len; buf = strbuf_detach(, NULL); } + if (working_tree_matches(path, buf, size, mode)) +goto free_buf; } if (make_room_for_path(o, path) < 0) { @@ -1782,20 +1810,8 @@ static int merge_content(struct merge_options *o, if (mfi.clean && !df_conflict_remains && oid_eq(, a_oid) && mfi.mode == a_mode) { - int path_renamed_outside_HEAD; output(o, 3, _("Skipped %s (merged same as existing)"), path); - /* - * The content merge resulted in the same file contents we - * already had. We can return early if those file contents - * are recorded at the correct path (which may not be true - * if the merge involves a rename). - */ - path_renamed_outside_HEAD = !path2 || !strcmp(path, path2); - if (!path_renamed_outside_HEAD) { - add_cacheinfo(o, mfi.mode, , path, - 0, (!o->call_depth), 0); - return mfi.clean; - } + /* We could set a flag here and pass it to "update_file()" */ } else output(o, 2, _("Auto-merging %s"), path);
Re: Optimizing writes to unchanged files during merges?
[ Still talking to myself. Very soothing. ] On Thu, Apr 12, 2018 at 4:55 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > [ Talking to myself ] > > Did it perhaps mean to say > > path_renamed_outside_HEAD = path2 && !strcmp(path, path2); > > instead? Probably not correct, but making that change makes my test-case work. It probably breaks something important, though. I didn't look at why that code happened in the first place. But it's nice to know I'm at least looking at the right code while I'm talking to myself. Linus
Re: Optimizing writes to unchanged files during merges?
[ Talking to myself ] On Thu, Apr 12, 2018 at 4:41 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Oddly, that *already* has the check: > > if (mfi.clean && !df_conflict_remains && > oid_eq(, a_oid) && mfi.mode == a_mode) { > int path_renamed_outside_HEAD; > output(o, 3, _("Skipped %s (merged same as existing)"), path); > > but that doesn't seem to actually trigger for some reason. Actually, that check triggers just fine. > But the code really seems to have the _intention_ of skipping the case > where the result ended up the same as the source. > > Maybe I'm missing something. The later check that does /* * The content merge resulted in the same file contents we * already had. We can return early if those file contents * are recorded at the correct path (which may not be true * if the merge involves a rename). */ path_renamed_outside_HEAD = !path2 || !strcmp(path, path2); if (!path_renamed_outside_HEAD) { will see that 'path2' is NULL, and not trigger this early out case, and then this all falls back to the normal cases after all. So I think that 'path_renamed_outside_HEAD' logic is somehow broken. Did it perhaps mean to say path_renamed_outside_HEAD = path2 && !strcmp(path, path2); instead? See commit 5b448b853 ("merge-recursive: When we detect we can skip an update, actually skip it") which really implies we want to actually skip it (but then we don't anyway). Also see commit b2c8c0a76 ("merge-recursive: When we detect we can skip an update, actually skip it") which was an earlier version, and which *actually* skipped it, but it was reverted because of that rename issue. Adding Elijah Newren to the cc, because he was working on this back then, an dis still around, and still working on merging ;) Linus
Re: Optimizing writes to unchanged files during merges?
On Thu, Apr 12, 2018 at 4:35 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > in process_entry(), and I think we could just there add a test for if > o_old,o_mod == a_oid,a_mode or something? Actually, not process_entry, but merge_content(). Oddly, that *already* has the check: if (mfi.clean && !df_conflict_remains && oid_eq(, a_oid) && mfi.mode == a_mode) { int path_renamed_outside_HEAD; output(o, 3, _("Skipped %s (merged same as existing)"), path); but that doesn't seem to actually trigger for some reason. But the code really seems to have the _intention_ of skipping the case where the result ended up the same as the source. Maybe I'm missing something. Linus
Re: Optimizing writes to unchanged files during merges?
On Thu, Apr 12, 2018 at 4:17 PM, Junio C Hamanowrote: > > A bit of detour. "Change in side branch happened to be a subset of > the change in trunk and gets subsumed, but we end up writing the > same result" happens also with the simpler resolve strategy. > > Here is a fix. Heh. Except git-merge-one-file.sh is *only* used for that case, so this doesn't fix the normal case. I verified that the normal case situation is that "update_file_flags()" thing that checks out the end result. It's called by this case: } else if (a_oid && b_oid) { /* Case C: Added in both (check for same permissions) and */ /* case D: Modified in both, but differently. */ clean_merge = merge_content(o, path, o_oid, o_mode, a_oid, a_mode, b_oid, b_mode, NULL); in process_entry(), and I think we could just there add a test for if o_old,o_mod == a_oid,a_mode or something? Linus
Re: Optimizing writes to unchanged files during merges?
On Thu, Apr 12, 2018 at 2:46 PM, Junio C Hamanowrote: > > Thanks for a clear description of the issue. It does sound > interesting. I decided to show it with a simpler case that could be scripted and doesn't need the kernel. NOTE! This obviously doesn't happen for files with the trivial merge cases (ie the ones that don't require any real merging at all). There *is* a three-way merge going on, it's just that the end result is the same as the original file. Example script just appended here at the end. NOTE: The script uses "ls -l --full-time", which afaik is a GNU ls extension. So the script below is not portable. Linus --- # Create throw-away test repository mkdir merge-test cd merge-test git init # Create silly baseline file with 10 lines of numbers in it for i in $(seq 1 10); do echo $i; done > a git add a git commit -m"Original" # Make a branch that changes '9' to 'nine' git checkout -b branch sed -i 's/9/nine/' a git commit -m "Nine" a # On the master, change '2' to 'two' _and_ '9' to 'nine' git checkout master sed -i 's/9/nine/' a sed -i 's/2/two/' a git commit -m "Two and nine" a # sleep to show the time difference sleep 1 # show the date on 'a' and do the merge ls -l --full-time a git merge -m "Merge contents" branch # The merge didn't change the contents, but did rewrite the file ls -l --full-time a
Optimizing writes to unchanged files during merges?
So I just had an interesting experience that has happened before too, but this time I decided to try to figure out *why* it happened. I'm obviously in the latter part of the kernel merge window, and things are slowly calming down. I do the second XFS merge during this window, and it brings in updates just to the fs/xfs/ subdirectory, so I expect that my test build for the full kernel configuration should be about a minute. Instead of recompiles pretty much *everything*, and the test build takes 22 minutes. This happens occasionally, and I blame gremlins. But this time I decided to look at what the gremlins actually *are*. The diff that git shows for the pull was very clear: only fs/xfs/ changes. But doing ls -tr $(git ls-files '*.[chS]') | tail -10 gave the real reason: in between all the fs/xfs/xyz files was this: include/linux/mm.h and yeah, that rather core header file causes pretty much everything to be re-built. Now, the reason it was marked as changed is that the xfs branch _had_ in fact changed it, but the changes were already upstream and got merged away. But the file still got written out (with the same contents it had before the merge), and 'make' obviously only looks at modification time, so make rebuilt everything. Now, because it's still the merge window, I don't have much time to look into this, but I was hoping somebody in git land would like to give it a quick look. I'm sure I'm not the only one to have ever been hit by this, and I don't think the kernel is the only project to hit it either. Because it would be lovely if the merging logic would just notice "oh, that file doesn't change", and not even write out the end result. For testing, the merge that triggered this git introspection is kernel commit 80aa76bcd364 ("Merge tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux"), which can act as a test-case. It's a clean merge, so no kernel knowledge necessary: just re-merge the parents and see if the modification time of include/linux/mm.h changes. I'm guessing some hack in update_file_flags() to say "if the new object matches the old one, don't do anything" might work. Although I didn't look if we perhaps end up writing to the working tree copy earlier already. Looking at the blame output of that function, most of it is really old, so this "problem" goes back a long long time. Just to clarify: this is absolutely not a correctness issue. It's not even a *git* performance issue. It's literally just a "not updating the working tree when things don't change results in better behavior for other tools". So if somebody gets interested in this problem, that would be great. And if not, I'll hopefully remember to come back to this next week when the merge window is over, and take a second look. Linus
Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git
On Mon, Mar 19, 2018, 04:34 Johannes Schindelinwrote: > > This is a real problem. No it isn't. We already handle those special cases specially, and install them in the bin directory (as opposed to libexec). And it all works fine. Look into the bin directory some day. You'll find things like git-cvsserver gitk git-receive-pack git-shell git-upload-archive git-upload-pack there, and the fact that a couple of them happen to be built-ins is an IMPLEMENTATION DETAIL, not a "Oh we should have used just 'git' for them". The design of having separate programs is the *good* conceptual design. And we damn well should keep it for these things that are used for special purposes. The fact that two of them have become built-ins as part of the git binary is incidental. It shouldn't be visible in the names, because it really is just an internal implementation thing, not anything fundamental. > And it is our own darned fault because we let an > implementation detail bleed into a protocol. We could have designed that a > lot better. And by "we" you clearly mean "not you", and by "we could have designed that a lot better" you must mean "and it was very well designed by competent people who didn't use bad operating systems". > Of course we should fix this, though. There is literally no good reason Go away. We shouldn't fix it, it's all fine as-is, and there were tons of f*cking good reasons for why git did what it did. The main one being "it's a collection of scripts", which was what git _was_, for chrissake. And using spaces and running some idiotic and hard-to-verify script de-multiplexer is the WRONG THING for things like "git-shell" and "git-receive-pack" and friends. Right now you can actually verify exactly what "git-shell" does. Or you could replace - or remove - it entirely if you don't like it. And never have to worry about running "git" with some "shell" subcommand. And you know that it's not an alias, for example. Because "git-xyz" simply does not look up aliases. So really. Go away, Johannes. Your concerns are complete and utter BS. The real problem is that Windows is badly designed, but since it's easy to work around (by using hard-linking on Windows), nobody sane cares. The solution is simple, and was already suggested: use symlinks (like we used to!) on non-windows systems. End of story. And for the libexec thing, we might want to deprecate those names, if somebody wants to, but it's not like it actually hurts, and it gives backwards compatibility. Btw, real Windows people know all about backwards compatibility. Ask around competent people inside MS whether it's an important thing. So stop this idiotic "bad design" crap. Somebody working on Windows simply can't afford your attitude. Somebody who didn't design it in the first place can't afford your attitude. Linus
Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git
On Thu, Mar 15, 2018 at 10:05 AM, Johannes Schindelinwrote: > The most sensible thing, of course, would be to *not* link the builtins at > all. I mean, we deprecated the dashed form (which was a design mistake, > whether you admit it or not) a long time ago. That's probably not a bad idea for the builtin commands. At least as an option. We do end up still using the dashed form for certain things, but they are already special-cased (ie things like "git-receive-pack" and "git-shell" that very much get executed directly, and for fundamental reasons). As to it being a design mistake? No, not really. It made a lot of sense at the time. The fact is, the problem is Windows, not git. I know you have your hangups, but that's your problem. Linus
Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git
On Wed, Mar 14, 2018 at 3:14 AM, Ævar Arnfjörð Bjarmasonwrote: > On Wed, Mar 14 2018, Johannes Sixt jotted: >> >> It is important to leave the default at hard-linking the binaries, >> because on Windows symbolic links are second class citizens (they >> require special privileges and there is a distinction between link >> targets being files or directories). Hard links work well. > > Yeah makes sense. I just want to add this as an option, and think if > it's proven to be un-buggy we could probably turn it on by default on > the *nix's if people prefer that, but yeah, we'll definitely need the > uname detection. I definitely would prefer to make symlinks the default on unix. It's what we used to do (long long ago), and as you pointed out, it's a lot clearer what's going on too when you don't have to look at inode numbers and link counts. Forcing hardlinking everywhere by default just because Windows filesystems suck donkey ass through a straw is not the right thing either. Linus
Re: [PATCH] revision: drop --show-all option
On Wed, Feb 21, 2018 at 3:27 PM, Jeff Kingwrote: > > We'll skip the usual deprecation period because this was > explicitly a debugging aid that was never documented. Ack. I don't think I've used it since, and probably nobody else ever used it. Linus
Re: linux-next: unnecessary merge in the v4l-dvb tree
On Tue, Feb 13, 2018 at 9:18 AM, Junio C Hamanowrote: > > That makes me wonder if another heuristic I floated earlier is more > appropriate. When merging a tag object T, if refs/tags/T exists and > it is that tag object, then an updated "merge" would default to "--ff"; > otherwise, it would keep the current default of creating a merge even > when we could fast-forward, in order to record that tag T in the > resulting history. Oooh. Yes, that sounds like the right thing to do. So the "no fast-forward" logic would trigger only if the name we used for merging is one of the temporary ones (ie .git/{FETCH,MERGE}_HEAD), not if the mentioned tag is already a normal tag reference. Then it's very explicitly about "don't lose the signing information". I'd still have to teach people to use "--only-ff" if they don't do the "fetch and merge" model but literally just do "git pull upstream vX.Y", but at least the case Mauro describes would automatically just DTRT. Me likey. Linus
Re: linux-next: unnecessary merge in the v4l-dvb tree
On Mon, Feb 12, 2018 at 1:44 PM, Junio C Hamanowrote: > > But I wonder why "update to upstream" is merging a signed tag in the > first place. Wouldn't downstream's "try to keep up with" pull be > grabbing from branch tips, not tags? I'm actually encouraging maintainers to *not* start their work on some random "kernel of the day". Particularly during the kernel merge window, the upstream master branch can be pretty flaky. It's *not* a good point to start new development on, so if you're a maintainer, you really don't want to use that as the basis for your work for the next merge window. So I encourage people to use a stable point for new development, and particularly actual release kernels. The natural way to do that is obviously just to create a new branch: git checkout -b topicbranch v4.15 and now you have a good new clean branch for your development. But clearly we've got a few people who have gotten used to the whole "git pull" convenience of both fetching and updating. Some maintainers don't even use topic branches, because their main work is just merging work by subdevelepoers (that goes for my own tree: I use topic branches for merging patch queues and to occasionally track my own experimental patches, but 99% of the time I'm just on "master" and obviously pull other peoples branches). And some maintainers end up using multiple repositories as branches (the old _original_ git model). Again, you can just use "git fetch + git reset", of course, but that's a bit unsafe. In contrast, doing "git pull --ff-only" is a safe convenient operation that does both the fetch and the update to whatever state. But you do need that "--ff-only" to avoid the merge. Linus
Re: linux-next: unnecessary merge in the v4l-dvb tree
On Mon, Feb 12, 2018 at 1:15 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > The reasoning is to avoid losing the signature from the tag (when > merging a signed tag, the signature gets inserted into the merge > commit itself - use "git log --show-signature" to see them). I think the commit that actually introduced the behavior was fab47d057: merge: force edit and no-ff mode when merging a tag object back in 2011, so we've had this behavior for a long time. So it's probably not be worth tweaking the behavior any more, and maybe we need to educate people to not update to other peoples state with "git pull". Maybe we could just tell people to have something like git config --global alias.update pull --ff-only and use that for "try to update to upstream". Linus
Re: linux-next: unnecessary merge in the v4l-dvb tree
On Mon, Feb 12, 2018 at 1:00 PM, Stephen Rothwellwrote: > > Linus, this happens a bit after the merge window, so I am wondering > about the rational of not doing a fast forward merge when merging a > signed tag (I forget the reasoning). The reasoning is to avoid losing the signature from the tag (when merging a signed tag, the signature gets inserted into the merge commit itself - use "git log --show-signature" to see them). So when I merge a signed tag, I do *not* want to fast-forward to the top commit, because then I'd lose the signature from the tag. Thus the "merging signed tags are non-fast-forward by default" reasoning. But, yes, that reasoning is really only valid for proper merges of new features, not for back-merges. The problem, of course, is that since git is distributed, git doesn't know who is "upstream" and who is "downstream", so there's no _technical_ difference between merging a development tree, and a development tree doing a back-merge of the upstream tree. Maybe it was a mistake to make signed tag merges non-fast-forward, since they cause these kinds of issues with people who use "pull" to update their otherwise unmodified trees. I can always teach myself to just use --no-ff, since I end up doing things like verifying at the signatures anyway. Junio, comments? Linus
Left-over COMMIT_EDITMSG file in gitdir
This may be intentional, but if so, it's not obvious.. Back long long ago, the original "git commit" shell script got rewritten in C. In that rewrite, removing some temporary files seems to have been left out. At least one: .git/COMMIT_EDITMSG. In the original commit.sh shell script, we can find this: rm -f "$GIT_DIR/COMMIT_MSG" "$GIT_DIR/COMMIT_EDITMSG" "$GIT_DIR/SQUASH_MSG" after creating the commit. In the C implementation, we do have a number of "unlink(...)" calls: unlink(git_path_cherry_pick_head()); unlink(git_path_revert_head()); unlink(git_path_merge_head()); unlink(git_path_merge_msg()); unlink(git_path_merge_mode()); unlink(git_path_squash_msg()); but no unlink(git_path_commit_editmsg()); and that *seems* to be an oversight. Similarly, builtin/tag,c leaves a stale TAG_EDITMSG file behind. Again, that actually did exist in the original shell script, which used to do trap 'rm -f "$GIT_DIR"/TAG_TMP* "$GIT_DIR"/TAG_FINALMSG "$GIT_DIR"/TAG_EDITMSG' 0 which caused that file to be removed at exit. I guess I really don't care much, but those two files struck me when I was doing a "git gc --prune=now" and looked at what was still left in the .git directory.. If this is all intentional, never mind. Linus
Re: [PATCH] enable core.fsyncObjectFiles by default
On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'owrote: > > Well, let's be fair; this is something *ext3* got wrong, and it was > the default file system back them. I'm pretty sure reiserfs and btrfs did too.. Linus
Re: [PATCH] enable core.fsyncObjectFiles by default
On Wed, Jan 17, 2018 at 3:16 PM, Ævar Arnfjörð Bjarmasonwrote: > > Or does overall FS activity and raw throughput (e.g. with an ISO copy) > matter more than general FS contention? Traditionally, yes. Also note that none of this is about "throughput". It's about waiting for a second or two when you do a simple "git commit" or similar. Also, hardware has changed. I haven't used a rotational disk in about 10 years now, and I don't worry about latencies so much more. The fact that you get ~100k iops indicates that you probably aren't using those stupid platters of rust either. So I doubt you can even trigger the bad cases. Linus
Re: [PATCH] enable core.fsyncObjectFiles by default
On Wed, Jan 17, 2018 at 2:07 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > The original git design was very much to write each object file > without any syncing, because they don't matter since a new object file > - by definition - isn't really reachable. Then sync before writing the > index file or a new ref. .. actually, I think it originally sync'ed only before starting to actually remove objects due to repacking. The theory was that you can lose the last commit series or whatever, and have to git-fsck, and have to re-do it, but you could never lose long-term work. If your machine crashes, you still remember what you did just before the crash. That theory may have been more correct back in the days than it is now. People who use git might be less willing to treat it like a filesystem that you can fsck than I was back 10+ ago. It's worth noting that the commit that Junio pointed to (from 2008) didn't actually change any behavior. It just allowed people who cared to change behavior. The original "let's not waste time on fsync every object write, because we can just re-create the state anyway" behavior goes back to 2005. Linus
Re: [PATCH] enable core.fsyncObjectFiles by default
On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmasonwrote: > > I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered > on the tests I thought might do a lot of loose object writes: > > $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux > GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master > fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh > [...] > Test > fsync-on~ fsync-on > > --- > 3400.2: rebase on top of a lot of unrelated changes > 1.45(1.30+0.17) 1.45(1.28+0.20) +0.0% > 3400.4: rebase a lot of unrelated changes without split-index > 4.34(3.71+0.66) 4.33(3.69+0.66) -0.2% > 3400.6: rebase a lot of unrelated changes with split-index > 3.38(2.94+0.47) 3.38(2.93+0.47) +0.0% > 0007.2: write_locked_index 3 times (3214 files) > 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% > >No impact. However I did my own test of running the test suite 10% >times with/without this patch, and it runs 9% slower: > > fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 > 21.63 21.95 > fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 > 23.83 23.88 That's not the thing you should check. Now re-do the test while another process writes to a totally unrelated a huge file (say, do a ISO file copy or something). That was the thing that several filesystems get completely and horribly wrong. Generally _particularly_ the logging filesystems that don't even need the fsync, because they use a single log for everything (so fsync serializes all the writes, not just the writes to the one file it's fsync'ing). The original git design was very much to write each object file without any syncing, because they don't matter since a new object file - by definition - isn't really reachable. Then sync before writing the index file or a new ref. But things have changed, I'm not arguing that the code shouldn't be made safe by default. I personally refuse to use rotating media on my machines anyway, largely exactly because of the fsync() issue (things like "firefox" started doing fsync on the mysql database for stupid things, and you'd get huge pauses). But I do think your benchmark is wrong. The case where only git does something is not interesting or relevant. It really is "git does something _and_ somebody else writes something entirely unrelated at the same time" that matters. Linus
Re: git pull
On Sun, Nov 19, 2017 at 7:37 PM, Junio C Hamanowrote: >> ... >> Which is simple. Just create a .git/hooks/prepare-commit-msg file that >> contains >> >> #!/bin/sh >> sed -i 's|ssh://gitolite.kernel.org/|git://git.kernel.org/|g' "$1" >> >> and make it executable, and git will do that commit message editing for you. > > This should work with any recent versions of Git (1.7.4.2 and > upwards), but it still is a workaround. Should we mark it as a > feature request in the Git land to record the URL you typed as-is in > the builtin/fetch.c::store_updated_refs() function, instead of the > one that was rewritten by the insteadOf mechanism? The main problem with the prepare-commit-msg thing is actually that is such a nasty hack, and it can change other things than just the remote name. Maybe "gitolite" might be mentioned in the shortlog of the merge, and then the sed script comes and edits that part too. It is really not a huge issue simply because those things don't really happen in real life, but it's the main thing about that prepare-commit-msg hook that makes me go "eww, what an ugly hack". But it's an ugly hack that just happens to work pretty well in practice. And yes, I looked at alternatives. In fact, I looked at a couple of other approaches: - the one you mention, namely to remember the original url, and use that instead - the one I'd actually prefer, which is to generalize the whole "insteadOf" to work in more situations. Why would I prefer that second one? It turns out that the "original" isn't actually necessarily what I'd want either. Several people send me pointers to "https://git.kernel.org/; and I prefer rewriting them to git:// just to be consistent. And now that people have started doing the "insteadOf", their pull requests end up containing that "git@gitolite" version of the URL, so again, I'd actually like to rewrite the url _anyway_ in the commit message. For example, for the kernel, the git.kernel.org address is very common, but it also has a very common path prefix, so almost all pull messages for the kernel have that "git://git.kernel.org/pub/scm/linux/kernel/git/" part in common, and I have occasioally felt that it's not adding a lot of value particularly as it shows up in shortlogs and gitk. I could change my own rules for the first line (instead of the "Merge tag xyz from git://..." maybe I should just have my human-legible version), but I have also considered just rewriting the url to something that shortens that very common thing. So maybe Merge tag 'sound-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound could be instead Merge tag 'sound-4.10-rc4' of git://kernel.org/../tiwai/sound which would keep the _important_ part, and shorten the boilerplate part away. But that kind of "insteadOf" would only be for the message, since the end result isn't actually a "real" URL at all, it's literally a rewritten shorthand. Again, I can do all of this with the sed script. But again, it's kind of wrong to do it on the whole message, when it's really only the url that it should affect. So it would potentially be nice to just have a generic "rewrite the url" model, where you can do it for remote fetches, but you could also do it for just the commit message, or you could do it for doing pushes (we do have that "pushinsteadof" already - exactly because you might want to pull and push from different versions, with the push having to use ssh). But, as you say: > It would probably need an update to "struct remote" to have new > fields, to teach remote.c::alias_all_urls() not to overwrite the > url[] (and pushurl[] merely for symmetry) fields, to add a field to > "struct transport" and teach transport.c::transport_get() to record > the original URL in it so that builtin/fetch.c::fetch_refs() can > give it to store_updated_refs() instead of the rewritten one. Yes, the existing "insteadOf" is actually hidden surprisingly deep in the remote code, and it's very non-obvious. That works ok for the pull and push case, but really not for just the message rewriting case (which doesn't happen as part of the pull, but as part of the "git merge" that then takes the FETCH_HEAD or MERGE_HEAD that contains the url, and parses that). Anyway, it's not a big deal. The sed script works. It's a bit hacky, but it does the job. I might have wished for a different model, but it's almost certainly not worth the effort unless somebody really gets fired up about this and decides they really want to do it. Linus
Re: RFC v3: Another proposed hash function transition plan
On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooperwrote: > > Ahhh, so if I understand you correctly, you'd prefer SHA-256 over > SHA3-256 because it's more performant for your usecase? Well, that's a > completely different animal that cryptographic suitability. In almost all loads I've seen, zlib inflate() cost is a bigger deal than the crypto load. The crypto people talk about cycles per byte, but the deflate code is what usually takes the page faults and cache misses etc, and has bad branch prediction. That ends up easily being tens or thousands of cycles, even for small data. But it does obviously depend on exactly what you do. The Windows people saw SHA1 as costly mainly due to the index file (which is just a "fancy crc", and not even cryptographically important, and where the cache misses actually happen when doing crypto, not decompressing the data). And fsck and big initial checkins can have a very different profile than most "regular use" profiles. Again, there the crypto happens first, and takes the cache misses. And the crypto is almost certainly _much_ cheaper than just the act of loading the index file contents in the first place. It may show up on profiles fairly clearly, but that's mostly because crypto is *intensive*, not because crypto takes up most of the cycles. End result: honestly, the real cost on almost any load is not crypto or necessarily even (de)compression, even if those are the things that show up. It's the cache misses and the "get data into user space" (whether using "read()" or page faulting). Worrying about cycles per byte of compression speed is almost certainly missing the real issue. The people who benchmark cryptography tend to intentionally avoid the actual real work, because they just want to know the crypto costs. So when you see numbers like "9 cycles per byte" vs "12 cycles per byte" and think that it's a big deal - 30% performance difference! - it's almost certainly complete garbage. It may be 30%, but it is likely 30% out of 10% total, meaning that it's almost in the noise for any but some very special case. Linus
Re: [PATCH] diff: correct newline in summary for renamed files
On Wed, Sep 27, 2017 at 2:58 PM, Stefan Bellerwrote: > > Linus, I assumed your sign off for the original patch. Thanks for spotting. > > Adding the mode change to t4016 seems like the easiest way to test it. Looks good to me, and you don't need to give me authorship credit. Just a "Reported-by:" is fine by me. But yes, you can also consider the patch signed off by me, although with your test update, _most_ of the patch is yours so it feels kind of stupid to mark me as author. Linus
Re: diffstat summary mode change bug
On Wed, Sep 27, 2017 at 1:40 PM, Stefan Bellerwrote: > > I disagree with this analysis, as the fix you propose adds the > new line unconditionally, i.e. this code path would be broken > regardless of "show filename or not". Right. Because it is what we want. The old code (before that commit) used to have two different cases: fprintf(file, "%s mode change %06o => %06o%c", line_prefix, p->one->mode, p->two->mode, show_name ? ' ' : '\n'); ie if "show_name" was set, it would *not* print a newline, and print a space instead. But then on the very next line, it used to do: if (show_name) { write_name_quoted(p->two->path, file, '\n'); ie now it prints the filename, and then prints the newline. End result: it used to *always* print the newline. Either it printed it at the end of the mode (for the non-show_name case), or it printed it at the end of the filename (for the show_name case). Your patch removed the '\n' entirely. My patch makes it unconditional, which it was before your patch (it was "conditional" only in where it was printed, not _whether_ it was printed). > I wonder why our tests failed to tell us about this. > > Specifically we have t4100/t-apply-4.expect > mode change 100644 => 100755 t/t-basic.sh > mode change 100644 => 100755 t/test-lib.sh > which would seem to exercise this code path. That only tests "git apply --stat --summary". It doesn't test "git diff" at all. And the "mode change" printout is entirely different code (see apply.c vs diff.c). Linus
diffstat summary mode change bug
Current git shows file-mode changes incorrectly in the diffstat summary, as I just noted from a pull request I did on the kernel. The pull request *should* have resulted in a summary like this: ... 21 files changed, 247 insertions(+), 67 deletions(-) mode change 100644 => 100755 tools/testing/selftests/memfd/run_tests.sh create mode 100644 tools/testing/selftests/net/reuseaddr_conflict.c but instead that "mode change" line didn't have a newline at the end, and I got ... 21 files changed, 247 insertions(+), 67 deletions(-) mode change 100644 => 100755 tools/testing/selftests/memfd/run_tests.sh create mode 100644 tools/testing/selftests/net/reuseaddr_conflict.c (ok, linewrapping in this email may make that look wrong - but the "mode change" land the "create mode" are both on the same line. Bisecting it got me: 146fdb0dfe445464fa438f3835557c58a01d85d7 is the first bad commit commit 146fdb0dfe445464fa438f3835557c58a01d85d7 Author: Stefan BellerDate: Thu Jun 29 17:07:05 2017 -0700 diff.c: emit_diff_symbol learns about DIFF_SYMBOL_SUMMARY Signed-off-by: Stefan Beller Signed-off-by: Junio C Hamano and the reason seems to be that the '\n' at the end got dropped as the old code was very confusing (the old code had two different '\n' cases for the "show filename or not"). I think the right fix is this whitespace-damaged trivial one-liner: diff --git a/diff.c b/diff.c index 3c6a3e0fa..653bb2e72 100644 --- a/diff.c +++ b/diff.c @@ -5272,6 +5272,7 @@ static void show_mode_change(struct diff_options *opt, struct diff_filepair *p, strbuf_addch(, ' '); quote_c_style(p->two->path, , NULL, 0); } + strbuf_addch(, '\n'); emit_diff_symbol(opt, DIFF_SYMBOL_SUMMARY, sb.buf, sb.len, 0); strbuf_release(); but somebody should double-check that. Linus
Re: RFC v3: Another proposed hash function transition plan
On Wed, Sep 13, 2017 at 6:43 AM, demerphqwrote: > > SHA3 however uses a completely different design where it mixes a 1088 > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > is *preserved between each block*. Yes. And considering that the SHA1 attack was actually predicated on the fact that each block was independent (no extra state between), I do think SHA3 is a better model. So I'd rather see SHA3-256 than SHA256. Linus
Re: BUG: attempt to trim too many characters
Just reminding people that this issue would seem to still exist in current master.. It's trivial to show: [torvalds@i7 git]$ git bisect start [torvalds@i7 git]$ git bisect bad master [torvalds@i7 git]$ git bisect good master~5 Bisecting: 23 revisions left to test after this (roughly 5 steps) [f35a1d75b5ecefaef7b1a8ec55543c82883df82f] Merge branch 'rs/t3700-clean-leftover' into maint [torvalds@i7 git]$ git rev-parse --bisect fatal: BUG: attempt to trim too many characters (Note: I use "git rev-parse --bisect" to show it as an error on the command line, but normal people would obviously do "gitk --bisect" that then does that "git rev-parse" internally and shows a UI error box instead). Linus On Tue, Sep 5, 2017 at 3:03 PM, Jeff King <p...@peff.net> wrote: > On Tue, Sep 05, 2017 at 02:55:08PM -0700, Linus Torvalds wrote: > >> On Tue, Sep 5, 2017 at 2:50 PM, Jeff King <p...@peff.net> wrote: >> > >> > What version of git are you running? This should be fixed by 03df567fbf >> > (for_each_bisect_ref(): don't trim refnames, 2017-06-18) which is in >> > v2.14. >> >> I'm way more recent than 2.14. >> >> I'm at commit 238e487ea ("The fifth batch post 2.14") > > Ugh. Bitten again by the fact that rev-parse and revision.c implement > the same things in subtly different ways. > > This probably fixes it: > > diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c > index 2bd28d3c08..9f24004c0a 100644 > --- a/builtin/rev-parse.c > +++ b/builtin/rev-parse.c > @@ -757,8 +757,8 @@ int cmd_rev_parse(int argc, const char **argv, const char > *prefix) > continue; > } > if (!strcmp(arg, "--bisect")) { > - for_each_ref_in("refs/bisect/bad", > show_reference, NULL); > - for_each_ref_in("refs/bisect/good", > anti_reference, NULL); > + for_each_fullref_in("refs/bisect/bad", > show_reference, NULL, 0); > + for_each_fullref_in("refs/bisect/good", > anti_reference, NULL, 0); > continue; > } > if (opt_with_value(arg, "--branches", )) {
Re: BUG: attempt to trim too many characters
On Tue, Sep 5, 2017 at 3:03 PM, Jeff Kingwrote: > > This probably fixes it: Yup. Thanks. Linus
Re: BUG: attempt to trim too many characters
On Tue, Sep 5, 2017 at 2:50 PM, Jeff Kingwrote: > > What version of git are you running? This should be fixed by 03df567fbf > (for_each_bisect_ref(): don't trim refnames, 2017-06-18) which is in > v2.14. I'm way more recent than 2.14. I'm at commit 238e487ea ("The fifth batch post 2.14") Linus
BUG: attempt to trim too many characters
I just got this with gitk --bisect while doing some bisection on my current kernel. It happens with "git rev-parse --bisect" too, but interestingly, "git log --bisect" works fine. I have not tried to figure out anything further, except that it was introduced by commit b9c8e7f2f ("prefix_ref_iterator: don't trim too much") and it happens with iter->iter0->refname = "refs/bisect/bad" iter->trim = 15 (where 15 is also the same as the length of that refname). Not having time to chase this down any more, since I'm busy with the merge window (and that annoying bisect) Linus
Re: [PATCH] strbuf: let strbuf_addftime handle %z and %Z itself
On Mon, Jun 12, 2017 at 2:10 PM, Jeff Kingwrote: > > I suspect nobody has complained because we generally encourage real > "-0800" names when specifying zones. That's what any sane person uses, and it's what SMTP requiries. The timezone names are a (bad) joke. If a human can't understand them (and people don't), there's no point in thinking that a computer should either. So yes, timezones should strive very much to be numeric. The names are an incomprehensible mess. Don't rely on them, and try to avoid using them. Linus
Re: Unaligned accesses in sha1dc
On Fri, Jun 2, 2017 at 1:17 PM, demerphqwrote: > Most hash function implementations have code like the following > (extracted and reduced from hv_macro.h in perl.git [which only > supports little-endian hash functions]): Yes. Please do *not* try to make things overly portable by adding random memcpy() functions. Yes, many compilers will do ok with it, and generate the right code (single load from an unaligned address). But many won't. Gcc completely screws it up in older versions on ARM, for example. Dereferencing an unaligned pointer may be "undefined" in some technical meaning, but it sure as hell isn't undefined in reality, and compilers that willfully do stupid things should not be catered to overly. Reality is a lot more important. And I think gcc can be tweaked to use "movbe" on x86 with the right magic incantation (ie not just __builtin_bswap32(), but also the switch to enable the new instructions). So having code to detect gcc and using __builtin_bswap32() would indeed be a good thing. Linus
Re: git merges of tags
On Thu, May 18, 2017 at 4:23 PM, Stephen Rothwellwrote: > > Just a reminder that if you are merging Linus' tree (or any tree > really) via a tag, git was changed some time ago so that merging a tag > will not do a fast forward (there is a good reason for this - I just > can't recall it ATM). The reason is that when you merge a signed tag, git squirrels away t he signature into the merge commit, so that you can see and verify the signage later (use "git log --show-signatures" to see the signatures on the commits). If you fast-forward, there isn't any new commit to add the signing data to. > To do the fast forward, try "git merge ^{}" ... A slightly simpler syntax might be just "tag^0", but yes, the "^{}" thing peels off any tags. > (unfortunately > doing "git merge --ff " also does not do a fast forward - it also > doesn't fail, it unexpectedly just creates a merge commit :-(). "--ff" is the default behavior, and means "allow fast forward", but note that it is about "allowing", not "forcing". You can use "--ff-only" to say that you will _only_ accept a fast-forward, and git will error out if it needs to create a merge. Linus
Re: [TANGENT] run-command: use vfork instead of fork
On Tue, May 16, 2017 at 12:35 PM, Eric Wongwrote: > > Fwiw, most of the vfork preparation was already done by Brandon > and myself a few weeks ago, and cooking in pu. Oh, interesting. Was that done for vfork(), or is it for something else? Some of the changes seem almost overly careful. Is this for prep-work porting to some odd environment that doesn't really have a MMU at all? There's nothing fundamentally wrong with allocating memory after fork(). But yes, it looks like it helps the vfork case. Linus
Re: git rebase regression: cannot pass a shell expression directly to --exec
On Tue, May 16, 2017 at 1:12 PM, Johannes Schindelinwrote: >> >> I think it would be better to just >> >> (a) get rid of the magic strcspn() entirely >> >> (b) make the 'can we optimize this' test be simply just looking up >> 'argv[0]' in $PATH > > What about > > ABC=1 my-executable my-arg What about it? Do you have a binary like '/usr/bin/ABC=1 my-executable my-arg' or something? If so, it gets executed. If not, it would get passed off to "/bin/sh". That sounds a lot more obvious than "some random set of characters are magical". Linus
Re: git rebase regression: cannot pass a shell expression directly to --exec
On Tue, May 16, 2017 at 10:23 AM, Jeff Kingwrote: > > I think the logic here would be more like: > > 1. During prepare_shell_cmd(), even if we optimize out the shell call, > still prepare a fallback argv (since we can't allocate memory > post-fork). > > 2. In the forked child, if we get ENOENT from exec and cmd->use_shell > is set, then exec the fallback shell argv instead. Propagate its > results, even if it's 127. > > That still means we'd prefer a $PATH copy of a command to its shell > builtin variant, but that can't be helped (and I kind of doubt anybody > would care too much). I think it would be better to just (a) get rid of the magic strcspn() entirely (b) make the 'can we optimize this' test be simply just looking up 'argv[0]' in $PATH Hmm? If you have executables with special characters in them in your PATH, you have bigger issues. Also, if people really want to optimize the code that executes an external program (whether in shell or directly), I think it might be worth it to look at replacing the "fork()" with a "vfork()". Something like this - cmd->pid = fork(); + cmd->pid = (cmd->git_cmd || cmd->env) ? fork() : vfork(); might work (the native git_cmd case needs a real fork, and if we change the environment variables we need it too, but the other cases look like they might work with vfork()). Using vfork() can be hugely more efficient, because you don't have the extra page table copies and teardown, but also avoid a lot of possible copy-on-write faults. Linus
[PATCH] Fix 'git am' in-body header continuations
From: Linus Torvalds <torva...@linux-foundation.org> Date: Sat, 1 Apr 2017 12:14:39 -0700 Subject: [PATCH] Fix 'git am' in-body header continuations An empty line should stop any pending in-body headers, and start the actual body parsing. This also modifies the original test for the in-body headers to actually have a real commit body that starts with spaces, and changes the test to check that the long line matches _exactly_, and doesn't get extra data from the body. Fixes:6b4b013f1884 ("mailinfo: handle in-body header continuations") Cc: Jonathan Tan <jonathanta...@google.com> Cc: Jeff King <p...@peff.net> Signed-off-by: Linus Torvalds <torva...@linux-foundation.org> --- On Sun, 2 Apr 2017, Junio C Hamano wrote: > > And that is exactly your patch does. The change "feels" correct to > me. Ok, resent with the test-case for the original behavior changed to be stricter (so it fails without this fix), and with Signed-off lines etc. I didn't really test the test-case very much, but it seemed to fail without this patch (because the "Body test" thing from the body becomes part of the long first line), and passes with it. But somebody who is more used to the test-suite should double-check my stupid test edit. mailinfo.c| 7 ++- t/t4150-am.sh | 6 -- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/mailinfo.c b/mailinfo.c index a489d9d0f..68037758f 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -757,8 +757,13 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line) assert(!mi->filter_stage); if (mi->header_stage) { - if (!line->len || (line->len == 1 && line->buf[0] == '\n')) + if (!line->len || (line->len == 1 && line->buf[0] == '\n')) { + if (mi->inbody_header_accum.len) { + flush_inbody_header_accum(mi); + mi->header_stage = 0; + } return 0; + } } if (mi->use_inbody_headers && mi->header_stage) { diff --git a/t/t4150-am.sh b/t/t4150-am.sh index 89a5bacac..44807e218 100755 --- a/t/t4150-am.sh +++ b/t/t4150-am.sh @@ -983,7 +983,9 @@ test_expect_success 'am works with multi-line in-body headers' ' rm -fr .git/rebase-apply && git checkout -f first && echo one >> file && - git commit -am "$LONG" --author="$LONG <l...@example.com>" && + git commit -am "$LONG + +Body test" --author="$LONG <l...@example.com>" && git format-patch --stdout -1 >patch && # bump from, date, and subject down to in-body header perl -lpe " @@ -997,7 +999,7 @@ test_expect_success 'am works with multi-line in-body headers' ' git am msg && # Ensure that the author and full message are present git cat-file commit HEAD | grep "^author.*l...@example.com" && - git cat-file commit HEAD | grep "^$LONG" + git cat-file commit HEAD | grep "^$LONG$" ' test_done -- 2.12.2.578.g5c4e54f4e
Re: Bug in "git am" when the body starts with spaces
On Fri, Mar 31, 2017 at 5:52 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > The continuation logic is oddly complex, and I can't follow the logic. > But it is completely broken in how it thinks empty lines are somehow > "continuations". The attached patch seems to work for me. Comments? The logic is fairly simple: if we encounter an empty line, and we have pending in-body headers, we flush the pending headers, and mark us as no longer in header mode. The code used to just ignore empty lines when in header mode, which is garbage, exactly because it would happily continue accumulating more headers. There may be other cases this misses, but this at least seems to fix the case I encountered. Linus mailinfo.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mailinfo.c b/mailinfo.c index a489d9d0f..68037758f 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -757,8 +757,13 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line) assert(!mi->filter_stage); if (mi->header_stage) { - if (!line->len || (line->len == 1 && line->buf[0] == '\n')) + if (!line->len || (line->len == 1 && line->buf[0] == '\n')) { + if (mi->inbody_header_accum.len) { + flush_inbody_header_accum(mi); + mi->header_stage = 0; + } return 0; + } } if (mi->use_inbody_headers && mi->header_stage) {
Re: Bug in "git am" when the body starts with spaces
Ok, did a bisect, and this bisects to commit 6b4b013f1884 ("mailinfo: handle in-body header continuations"). The continuation logic is oddly complex, and I can't follow the logic. But it is completely broken in how it thinks empty lines are somehow "continuations". Jonathan? Linus On Fri, Mar 31, 2017 at 5:24 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > I think the reason is that the "header continuation line" logic kicks > in because the lines in the body start with spaces, but that's > entirely incorrect, since > > (a) we're not in an email header > > (b) there's an empty line in between anyway, so no way are those body > lines continuation lines. > > I didn't check how far back this goes, I guess I'll do that next. But > I thought I'd report it here first in case somebody else goes "ahhh".
Bug in "git am" when the body starts with spaces
Try applying the attached patch with git am 0001-Test-patch.patch in the git repository. At least for me, it results in a very odd commit that has one single line in the commit message: Test patch This should go in the body not in the subject line which is obviously bogus. I think the reason is that the "header continuation line" logic kicks in because the lines in the body start with spaces, but that's entirely incorrect, since (a) we're not in an email header (b) there's an empty line in between anyway, so no way are those body lines continuation lines. I didn't check how far back this goes, I guess I'll do that next. But I thought I'd report it here first in case somebody else goes "ahhh". Linus From ad65cf7ba97ac071da1f845ec854165e7bf1efdf Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torva...@linux-foundation.org> Date: Fri, 31 Mar 2017 17:18:16 -0700 Subject: [PATCH] Test patch example Subject: [PATCH] Test patch This should go in the body not in the subject line --- Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/Makefile b/Makefile index 9b36068ac..9f36c149b 100644 --- a/Makefile +++ b/Makefile @@ -1,3 +1,4 @@ + # The default target of this Makefile is... all:: -- 2.12.2.401.g5d4234a49
Re: USE_SHA1DC is broken in pu
On Thu, Mar 23, 2017 at 9:43 AM, Johannes Schindelinwrote: > > What I am saying is that this should be a more fine-grained, runtime knob. No it really shouldn't. > If I write out an index, I should not suffer the slowdown from detecting > collisions. The index case is a complete red herring. As already noted, the proper fix for the index case is to simply do it asynchronously on read. On write, it's harder to do asynchronously, but for a 300MB index file you're likely going to be doing IO in the middle, so it's probably not even noticeable. But even *if* you want to worry about the index file, it shouldn't be a runtime knob. The index file is completely different from the other uses of SHA1DC. But the fact is, if you don't want SHA1DC, and you have crazy special cases, you just continue to build with openssl support instead. Nobody else should ever have to worry about *your* crazy cases. Linus
Re: [PATCH 0/2] Re-integrate sha1dc
On Thu, Mar 16, 2017 at 3:04 PM, Jeff Kingwrote: > > There are a few things I think are worth changing. The die() message > should mention the sha1 we computed. That will be a big help if an old > version of git tries to unknowingly push a colliding object to a newer > version. The user will see "collision on sha1 1234.." which gives them a > starting point to figure out where they got the bad object from. > > And to make that work, we have to disable the safe_hash feature (which > intentionally corrupts a colliding sha1). We _could_ rip it out > entirely, but since it only kicks in when we see a collision, I doubt > it's impacting anything. > > I also updated the timings in my commit message, and added a basic test. No complaints about your version. Linus
[PATCH 2/2] Integrate the sha1dc code with the git build
From: Linus Torvalds <torva...@linux-foundation.org> Date: Thu, 16 Mar 2017 13:08:38 -0700 Subject: [PATCH 2/2] Integrate the sha1dc code with the git build This adds the proper magic to actually build the sha1dc code as part of git when USE_SHA1DC is enabled. This includes - adjusting the sha1dc include directives for git use - adding the proper USE_SHA1DC logic to the Makefile - adding the SHA1DC case to the "hash.h" header - adding the proper "git platform" wrappers for the SHA1 interface Much of this comes from Jeff King's previous integration effort, with modifications for the new world order of hash.h. Signed-off-by: Linus Torvalds <torva...@linux-foundation.org> --- Makefile | 10 ++ hash.h | 2 ++ sha1dc/sha1.c | 33 + sha1dc/sha1.h | 18 -- sha1dc/ubc_check.c | 4 ++-- sha1dc/ubc_check.h | 2 -- 6 files changed, 55 insertions(+), 14 deletions(-) diff --git a/Makefile b/Makefile index a5a11e721..186ce17f2 100644 --- a/Makefile +++ b/Makefile @@ -140,6 +140,10 @@ all:: # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # +# Define USE_SHA1DC to unconditionally enable the collision-detecting sha1 +# algorithm. This is slower, but may detect attempted collision attacks. +# Takes priority over other *_SHA1 knobs. +# # Define SHA1_MAX_BLOCK_SIZE to limit the amount of data that will be hashed # in one call to the platform's SHA1_Update(). e.g. APPLE_COMMON_CRYPTO # wants 'SHA1_MAX_BLOCK_SIZE=1024L*1024L*1024L' defined. @@ -1383,6 +1387,11 @@ ifdef APPLE_COMMON_CRYPTO SHA1_MAX_BLOCK_SIZE = 1024L*1024L*1024L endif +ifdef USE_SHA1DC + LIB_OBJS += sha1dc/sha1.o + LIB_OBJS += sha1dc/ubc_check.o + BASIC_CFLAGS += -DSHA1DC +else ifdef BLK_SHA1 LIB_OBJS += block-sha1/sha1.o BASIC_CFLAGS += -DSHA1_BLK @@ -1400,6 +1409,7 @@ else endif endif endif +endif ifdef SHA1_MAX_BLOCK_SIZE LIB_OBJS += compat/sha1-chunked.o diff --git a/hash.h b/hash.h index f0d9ddd0c..b9e7e34fc 100644 --- a/hash.h +++ b/hash.h @@ -3,6 +3,8 @@ #if defined(SHA1_PPC) #include "ppc/sha1.h" +#elif defined(SHA1DC) +#include "sha1dc/sha1.h" #elif defined(SHA1_APPLE) #include #elif defined(SHA1_OPENSSL) diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c index 8d12b832b..76a8d1a85 100644 --- a/sha1dc/sha1.c +++ b/sha1dc/sha1.c @@ -5,14 +5,9 @@ * https://opensource.org/licenses/MIT ***/ -#include -#include -#include -#include - -#include "sha1.h" -#include "ubc_check.h" - +#include "git-compat-util.h" +#include "sha1dc/sha1.h" +#include "sha1dc/ubc_check.h" /* Because Little-Endian architectures are most common, @@ -1790,3 +1785,25 @@ int SHA1DCFinal(unsigned char output[20], SHA1_CTX *ctx) output[19] = (unsigned char)(ctx->ihv[4]); return ctx->found_collision; } + +static const char collision_message[] = +"The SHA1 computation detected evidence of a collision attack;\n" +"refusing to process the contents."; + +void git_SHA1DCFinal(unsigned char hash[20], SHA1_CTX *ctx) +{ + if (SHA1DCFinal(hash, ctx)) + die(collision_message); +} + +void git_SHA1DCUpdate(SHA1_CTX *ctx, const void *vdata, unsigned long len) +{ + const char *data = vdata; + /* We expect an unsigned long, but sha1dc only takes an int */ + while (len > INT_MAX) { + SHA1DCUpdate(ctx, data, INT_MAX); + data += INT_MAX; + len -= INT_MAX; + } + SHA1DCUpdate(ctx, data, len); +} diff --git a/sha1dc/sha1.h b/sha1dc/sha1.h index e867724c0..37ee415da 100644 --- a/sha1dc/sha1.h +++ b/sha1dc/sha1.h @@ -9,8 +9,6 @@ extern "C" { #endif -#include - /* uses SHA-1 message expansion to expand the first 16 words of W[] to 80 words */ /* void sha1_message_expansion(uint32_t W[80]); */ @@ -100,6 +98,22 @@ void SHA1DCUpdate(SHA1_CTX*, const char*, size_t); /* returns: 0 = no collision detected, otherwise = collision found => warn user for active attack */ int SHA1DCFinal(unsigned char[20], SHA1_CTX*); + +/* + * Same as SHA1DCFinal, but convert collision attack case into a verbose die(). + */ +void git_SHA1DCFinal(unsigned char [20], SHA1_CTX *); + +/* + * Same as SHA1DCUpdate, but adjust types to match git's usual interface. + */ +void git_SHA1DCUpdate(SHA1_CTX *ctx, const void *data, unsigned long len); + +#define platform_SHA_CTX SHA1_CTX +#define platform_SHA1_Init SHA1DCInit +#define platform_SHA1_Update git_SHA1DCUpdate +#define platform_SHA1_Final git_SHA1DCFinal + #if defined(__cplusplus) } #endif diff --git a/sha1dc/ubc_check.c b/sha1dc/ubc_check.c index 27d0976da..089dd4743 100644 --- a/sha1dc/ubc_check.c +++ b/sha1dc/ubc_check.c @@ -24,8 +24,8 @@ // ubc_check has been ve
Re: USE_SHA1DC is broken in pu
On Thu, Mar 16, 2017 at 12:51 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > I'll send a patch on top of 'next', which already has the header file changes. Patches sent. It all looked fairly straightforward to me, but maybe I missed something. Linus
[PATCH 0/2] Re-integrate sha1dc
I suspect the first patch will not make it to the list since it's over 100kB in size, but oh well.. Junio and Jeff will see it. This is sent as two patches, just to have the original upstream code as a first step, and then the second patch does the small modifications to integrate it with git. It "WorksForMe(tm)" and the integration patches are now fairly trivial, since upstream already did the dieting and some of the semantic changes to gits more traditional C code. I did leave the C++ wrapper lines that the sha1dc header files have grown in the meantime, I debated removing them but felt that "closer to upstream" was worth it. Linus
Re: USE_SHA1DC is broken in pu
On Thu, Mar 16, 2017 at 12:46 PM, Junio C Hamanowrote: > > That's easy to answer. What we have on 'pu' is a fair game for > wholesale replacement. That is the whole point of not merging > topics in flux to 'next' and declaring that 'pu' will constantly > rewind. Ok. I'll send a patch on top of 'next', which already has the header file changes. Linus
Re: USE_SHA1DC is broken in pu
On Thu, Mar 16, 2017 at 12:41 PM, Jeff Kingwrote: > > Potentially we should just eject sha1dc from "pu" for the moment. It > needs re-rolled with the most recent version of the collision library > (and I see Marc just posted that they hit a stable point, which is > perhaps why you're looking at it). I looked at it, and even created a patch, and then decided that you'd probably do it. But yes, re-integrating entirely (rather than creating a patch against the previous SHA1DC) might be simpler. Junio, which way do you want to go? Linus
USE_SHA1DC is broken in pu
I think there's a semantic merge error and it clashes with f18f816cb158 ("hash.h: move SHA-1 implementation selection into a header file"). Suggested possible merge resolution attached. Linus Makefile | 2 +- hash.h | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 9d1d958bd..186ce17f2 100644 --- a/Makefile +++ b/Makefile @@ -1388,9 +1388,9 @@ ifdef APPLE_COMMON_CRYPTO endif ifdef USE_SHA1DC - SHA1_HEADER = "sha1dc/sha1.h" LIB_OBJS += sha1dc/sha1.o LIB_OBJS += sha1dc/ubc_check.o + BASIC_CFLAGS += -DSHA1DC else ifdef BLK_SHA1 LIB_OBJS += block-sha1/sha1.o diff --git a/hash.h b/hash.h index f0d9ddd0c..b7f4f1fd8 100644 --- a/hash.h +++ b/hash.h @@ -7,6 +7,8 @@ #include #elif defined(SHA1_OPENSSL) #include +#elif defined(SHA1DC) +#include "sha1dc/sha1.h" #else /* SHA1_BLK */ #include "block-sha1/sha1.h" #endif
Re: [PATCH] Put sha1dc on a diet
On Mon, Mar 13, 2017 at 1:12 PM, Marc Stevenswrote: > Indeed, I've committed a fix, and a small bug fix for the new code just now. Unrelated side note: there may be some missing dependencies in the build infrastructure or something, because when I tried Jeff's script that did that "test the merge and the two parents", and used the pack-file of the kernel for testing, I got: 5611971c610143e6d38bbdca463f4c9f79a056a0 /home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack real 0m2.432s user 0m2.348s sys 0m0.084s 5611971c610143e6d38bbdca463f4c9f79a056a0 /home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack real 0m3.747s user 0m3.672s sys 0m0.076s 5611971c610143e6d38bbdca463f4c9f79a056a0 *coll* /home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack real 0m5.061s user 0m4.984s sys 0m0.077s never mind the performace, notice the *coll* in that last case. But doing a "git clean -dqfx; make -j8" and re-testing the same tree, the issue is gone. I suspect some dependency on a header file is broken, causing some object file to not be properly re-built, which in turn then incorrectly causes the 'ctx2.found_collision' test to test the wrong bit or something. Linus
Re: Stable GnuPG interface, git should use GPGME
On Fri, Mar 10, 2017 at 2:00 AM, Bernhard E. Reiterwrote: > > git uses an pipe-and-exec approach to running a GnuPG binary > as writen in the documentation [1]: > > gpg.program >Use this custom program instead of "gpg" found on $PATH when making >or verifying a PGP signature. The program must support the same >command-line interface as GPG > > please consider using libgpgme interfacing to GnuPG, because the gpg > command-line interface is not considered an official API to GnuPG by the > GnuPG-devs and thus potentially unstable. Quite frankly, I will NAK this just based on previous bad experiences with using "helpful" libraries. Maybe you can lay my worries to rest, but the problems with libraries in this context tend to be - hard to personalize. At least right now, we just allow people to set their gpg binary. I'm betting that the library would pick whatever random preferred version, and in the process possibly screwed us. Example: what if somebody is actually using another pgp implementation entirely for some reason, and is just scripting around it? Or maybe he's using the regular gnupg, but using different keys for different projects (using "--homedir"). That's trivial with the current model. How trivial is that with a library? - existing configuration This is the main problem I've seen in the past. Using the "ssh" _program_ is easy. You add your keys, your config files, whatever, and it "just works" (or rather, you fight it once and it definitely doesn't "just" work, but then you copy your .ssh directory around for the rest of your and forget how it ever worked, but it does). Using "libssh2" is an exercise in futility, and you have to do a crazy amount of stupid "look up keys" and simple configuration in your .ssh/config (like per-host keys, hostname swizzling etc) just don't pick up the configurations you already did for the program. - UI For things like gpg, the UI is traditionally horrible. But there tends to be various things like password agents that help with caching passwords and/or add a graphical UI to get the password when necessary. - library versioning. I don't know why, but I've never *ever* met a library developer who realized that libraries were all about stable API's, and the library users don't want to fight different versions. And to make matters worse, the different versions (particularly if you end up having to use a development version due to bugs or required features etc) are always made horribly bad to even detect at built-time automatically with simple #ifdef etc, so now you have to do autoconf crap etc. Now, it may be that the pgpme library "just works" across architectures and handles all of the above situations as gracefully as the external program does. In that case - but _ONLY_ in that case - would a switch-over to the library possibly be a good thing. I'd be pleasantly surprised. But I *would* be surprised, because every time I've seen that "library vs program" model, I've seen the above issues. In fact, we have those exact issues very much in git itself too. Yes, I've used libgit2 (for subsurface). It's a pain in the arse to do *exactly* the above kinds of things, and the thing is, that isn't git-specific. So I'm very down on using external libraries unless they are stable and have no need for configuration etc. Things like zlib is fine - there just isn't much to configure outside of the "just how hard do you want me to try to compress". Nobody has a ".zlib/config" file that you need to worry about accessing etc. Of course, maybe pgpme is a world first, and actually does read your .gnupg/config file trivially, and has all the gpg agent integration that it picks up automatically, and allows various per-user configurations, and all my worries are bogus. But that would literally be the first time I've ever seen that. Linus
Re: RFC: Another proposed hash function transition plan
On Tue, Mar 7, 2017 at 10:57 AM, Ian Jacksonwrote: > > Also I think you need to specify how abbreviated object names are > interpreted. One option might be to not use hex for the new hash, but base64 encoding. That would make the full size ASCII hash encoding length roughly similar (43 base64 characters rather than 40), which would offset some of the new costs (longer filenames in the loose format, for example). Also, since 256 isn't evenly divisible by 6, and because you'd want some way to explictly disambiguate the new hashes, the rule *could* be that the ASCII representation of a new hash is the base64 encoding of the 258-bit value that has "10" prepended to it as padding. That way the first character of the hash would be guaranteed to not be a hex digit, because it would be in the range [g-v] (indexes 32..47). Of course, the downside is that base64 encoded hashes can also end up looking very much like real words, and now case would matter too. The "use base64 with a "10" two-bit padding prepended" also means that the natural loose format radix format would remain the first 2 characters of the hash, but due to the first character containing the padding, it would be a fan-out of 2**10 rather than 2**12. Of course, having written that, I now realize how it would cause problems for the usual shit-for-brains case-insensitive filesystems. So I guess base64 encoding doesn't work well for that reason. Linus
Re: RFC: Another proposed hash function transition plan
On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tanwrote: > > I think "nohash" can be explained in 2 points: I do think that that was my least favorite part of the suggestion. Not just "nohash", but all the special "hash" lines too. I would honestly hope that the design should not be about "other hashes". If you plan your expectations around the new hash being broken, something is wrong to begin with. I do wonder if things wouldn't be simpler if the new format just included the SHA1 object name in the new object. Put it in the "header" line of the object, so that every time you look up an object, you just _see_ the SHA1 of that object. You can even think of it as an additional protection. Btw, the multi-collision attack referenced earlier does _not_ work for an iterated hash that has a bigger internal state than the final hash. Which is actually a real argument against sha-256: the internal state of sha-256 is 256 bits, so if an attack can find collisions due to some weakness, you really can then generate exponential collisions by chaining a linear collision search together. But for sha3-256 or blake2, the internal hash state is larger than the final hash, so now you need to generate collisions not in the 256 bits, but in the much larger search space of the internal hash space if you want to generate those exponential collisions. So *if* the new object format uses a git header line like "blob \0" then it would inherently contain that mapping from 256-bit hash to the SHA1, but it would actually also protect against attacks on the new hash. In fact, in particular for objects with internal format that differs between the two hashing models (ie trees and commits which to some degree are higher-value targets), it would make attacks really quite complicated, I suspect. And you wouldn't need those "hash" or "nohash" things at all. The old SHA1 would simply always be there, and cheap to look up (ie you wouldn't have to unpack the whole object). Hmm? Linus
Re: Delta compression not so effective
On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsenwrote: > > I reran the repack with the options above (dropping the zlib=9, as you > suggested) > > $ time git -c pack.threads=4 repack -a -d -F \ >--window=350 --depth=250 --window-memory=30g > > and ended up with > $ du -sh . > 205G. > > In other words, going from 6G to 30G window didn't help a lick on finding > deltas for those binaries. Ok. > I did > fprintf(stderr, "%s %u %lu\n", > sha1_to_hex(delta_list[i]->idx.sha1), > delta_list[i]->hash, > delta_list[i]->size); > > I assume that's correct? Looks good. > I've removed all commit messages, and "sanitized" some filepaths etc, so > name hashes won't match what's reported, but that should be fine. (the > object_entry->hash seems to be just a trivial uint32 hash for sorting > anyways) Yes. I see your name list and your pack-file index. > BUT, if I look at the last 3 entries of the sorted git verify-pack output, > and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I > get: ... > while I cannot find ANY of them in the delta_list output?? \ Yes. You have a lot of of object names in that log file you sent in private that aren't in the delta list. Now, objects smaller than 50 bytes we don't ever try to even delta. I can't see the object sizes when they don't show up in the delta list, but looking at some of those filenames I'd expect them to not fall in that category. I guess you could do the printout a bit earlier (on the "to_pack.objects[]" array - to_pack.nr_objects is the count there). That should show all of them. But the small objects shouldn't matter. But if you have a file like extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib I would have assumed that it has a size that is > 50. Unless those "extern" things are placeholders? > You might get an idea for how to easily create a repo which reproduces the > issue, and which would highlight it more easily for the ML. Looking at your sorted object list ready for packing, it doesn't look horrible. When sorting for size, it still shows a lot of those large files with the same name hash, so they sorted together in that form too. I do wonder if your dll data just simply is absolutely horrible for xdelta. We've also limited the delta finding a bit, simply because it had some O(m*n) behavior that gets very expensive on some patterns. Maybe your blobs trigger some of those case. The diff-delta work all goes back to 2005 and 2006, so it's a long time ago. What I'd ask you to do is try to find if you could make a reposity of just one of the bigger DLL's with its history, particularly if you can find some that you don't think is _that_ sensitive. Looking at it, for example, I see that you have that file extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so that seems to have changed several times, and is a largish blob. Could you try creating a repository with git fast-import that *only* contains that file (or pick another one), and see if that delta's well? And if you find some case that doesn't xdelta well, and that you feel you could make available outside, we could have a test-case... Linus
Re: RFC: Another proposed hash function transition plan
On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Niederwrote: > > This document is still in flux but I thought it best to send it out > early to start getting feedback. This actually looks very reasonable if you can implement it cleanly enough. In many ways the "convert entirely to a new 256-bit hash" is the cleanest model, and interoperability was at least my personal concern. Maybe your model solves it (devil in the details), in which case I really like it. I do think that if you end up essentially converting the objects without really having any true backwards compatibility at the object layer (just the translation code), you should seriously look at doing some other changes at the same time. Like not using zlib compression, it really is very slow. Btw, I do think the particular choice of hash should still be on the table. sha-256 may be the obvious first choice, but there are definitely a few reasons to consider alternatives, especially if it's a complete switch-over like this. One is large-file behavior - a parallel (or tree) mode could improve on that noticeably. BLAKE2 does have special support for that, for example. And SHA-256 does have known attacks compared to SHA-3-256 or BLAKE2 - whether that is due to age or due to more effort, I can't really judge. But if we're switching away from SHA1 due to known attacks, it does feel like we should be careful. Linus
Re: SHA1 collisions found
On Thu, Mar 2, 2017 at 5:50 PM, Mike Hommeywrote: > > What if the "object version" is a hash of the content (as opposed to > header + content like the normal git hash)? It doesn't actually matter for that attack. The concept of the attack is actually fairly simple: generate a lot of collisions in the first hash (and they outline how you only need to generate 't' serial collisions and turn them into 2**t collisions), and then just check those collisions against the second hash. If you have enough collisions in the first one, having a collision in the second one is inevitable just from the birthday rule. Now, *In*practice* that attack is not very easy to do. Each collision is still hard to generate. And because of the git object rules (the size has to match), it limits you a bit in what collisions you generate. But the fact that git requires the right header can be considered just a variation on the initial state to SHA1, and then the additional requirement might be as easy as just saying that your collision generation function just always needs to generate a fixed-size block (which could be just 64 bytes - the SHA1 blocking size). So assuming you can arbitrarily generate collisions (brute force: 2**80 operations) you could make the rule be that you generate something that starts with one 64-byte block that matches the git rules: "blob 6454\0"..pad with repeating NUL bytes.. and then you generate 100 pairs of 64-byte SHA1 collisions (where the first starts with the initial value of that fixed blob prefix, the next with the state after the first block, etc etc). Now you can generate 2**100 different sequences that all are exactly 6464 bytes (101 64-byte blocks) and all have the same SHA1 - all all share that first fixed 64-byte block. You needed "only" on the order of 100 * 2**80 SHA1 operations to do that in theory. An since you have 2**100 objects, you know that you will have a likely birthday paradox even if your secondary hash is 200 bits long. So all in all, you generated a collision in on the order of 2**100 operations. So instead of getting the security of "160+200" bits, you only got 200 bits worth of real collision resistance. NOTE!! All of the above is very much assuming "brute force". If you can brute-force the hash, you can completely break any hash. The input block to SHA1 is 64 bytes, so by definition you have 512 bits of data to play with, and you're generating a 160-bit output: there is no question what-so-ever that you couldn't generate any hash you want if you brute-force things. The place where things like having a fixed object header can help is when the attack in question requires some repeated patterns. For example, if you're not brute-forcing things, your attack on the hash will likely involve using very particular patterns to change a number of bits in certain ways, and then combining those particular patterns to get the hash collision you wanted. And *that* is when you may need to add some particular data to the middle to make the end result be a particular match. But a brute-force attack definitely doesn't need size changes. You can make the size be pretty much anything you want (modulo really small inputs, of course - a one-byte input only has 256 different possible hashes ;) if you have the resources to just go and try every possible combination until you get the hash you wanted. I may have overly simplified the paper, but that's the basic idea. Linus
Re: SHA1 collisions found
On Thu, Mar 2, 2017 at 12:43 PM, Junio C Hamanowrote: > > My reaction heavily depends on how that "object version" thing > works. > > Would "object version" be like a truncated SHA-1 over the same data > but with different IV or something, i.e. something that guarantees > anybody would get the same result given the data to be hashed? Yes, it does need to be that in practice. So what I was thinking the object version would be is: (a) we actually take the object type into account explicitly. (b) we explicitly add another truncated hash. The first part we can already do without any actual data structure changes, since basically all users already know the type of an object when they look it up. So we already have information that we could use to narrow down the hash collision case if we saw one. There are some (very few) cases where we don't already explicitly have the object type (a tag reference can be any object, for example, and existing scripts might ask for "give me the type of this SHA1 object with "git cat-file -t"), but that just goes back to the whole "yeah, we'll handle legacy uses and we will look up objects even _without_ the extra version data, so it actually integrates well into the whole notion. Basically, once you accept that "hey, we'll just have a list of objects with that hash", it just makes sense to narrow it down by the object type we also already have. But yes, the object type is obviously only two bits of information (actually, considering the type distribution, probably just one bit), and it's already encoded in the first hash, so it doesn't actually help much as "collision avoidance" particularly once you have a particular attack against that hash in place. It's just that it *is* extra information that we already have, and that is very natural to use once you start thinking of the hash lookup as returning a list of objects. It also mitigates one of the worst _confusions_ in git, and so basically mitigates the worst-case downside of an attack basically for free, so it seems like a no-brainer. But the real new piece of object version would be a truncated second hash of the object. I don't think it matters too much what that second hash is, I would say that we'd just approximate having a total of 256 bits of hash. Since we already have basically 160 bits of fairly good hashing, and roughly 128 bits of that isn't known to be attackable, we'd just use another hash and truncate that to 128 bits. That would be *way* overkill in practice, but maybe overkill is what we want. And it wouldn't really expand the objects all that much more than just picking a new 256-bit hash would do. So you'd have to be able to attack both the full SHA1, _and_ whatever other different good hash to 128 bits. Linus PS. if people think that SHA1 is of a good _size_, and only worry about the known weaknesses of the hashing itself, we'd only need to get back the bits that the attacks take away from brute force. That's currently the 80 -> ~63 bits attack, so you'd really only want about 40 bits of second hash to claw us back back up to 80 bits of brute force (again: brute force is basically sqrt() of the search space, so half the bits, so adding 40 bits of hash adds 20 bits to the brute force cost and you'd get back up to the 2**80 we started with). So 128 bits of secondary hash really is much more than we'd need. 64 bits would probably be fine.
Re: SHA1 collisions found
On Thu, Mar 2, 2017 at 1:54 PM, Joey Hesswrote: > > There's a surprising result of combining iterated hash functions, that > the combination is no more difficult to attack than the strongest hash > function used. Duh. I should actually have known that. I started reading the paper and went "this seems very familiar". I'm pretty sure I've been pointed at that paper before (or maybe just a similar one), and I just didn't react enough for it to leave a lasting impact. Linus
Re: SHA1 collisions found
On Fri, Feb 24, 2017 at 4:39 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Honestly, I think that a primary goal for a new hash implementation > absolutely needs to be to minimize mixing. > > Not for security issues, but because of combinatorics. You want to > have a model that basically reads old data, but that very aggressively > approaches "new data only" in order to avoid the situation where you > have basically the exact same tree state, just _represented_ > differently. > > For example, what I would suggest the rules be is something like this: Hmm. Having looked at this a fair amount, and in particularly having looked at the code as part of the hash typesafety patch I did, I am actually starting to think that it would not be too painful at all to have a totally different approach, which might be a lot easier to do. So bear with me, let me try to explain my thinking: (a) if we want to be backwards compatible and not force people to convert their trees on some flag day, we're going to be stuck with having to have the SHA1 code around and all the existing object parsing for basically forever Now, this basically means that the _easy_ solution would be that we just do the flag day, switch to sha-256, extend everything to 32-byte hashes, and just have a "git2 fast-import" that makes it easy to convert stuff. But it really would be a completely different version of git, with a new pack-file format and no real compatibility. Such a flag-day approach would certainly have advantages: it would allow for just re-architecting some bad choices: - make the hashing be something that can be threaded (ie maybe we can just block it up in 4MB chunks that you can hash in parallel, and make the git object hash be the hash of hashes) - replace zlib with something like zstd - get rid of old pack formats etc. but on the whole, I still think that the compatibility would be worth much more than the possible technical advantages of a clean slate restart. (b) the SHA1 hash is actually still quite strong, and the collision detection code effectively means that we don't really have to worry about collisions in the immediate future. In other words, the mitigation of the current attack is actually really easy technically (modulo perhaps the performance concerns), and there's still nothing fundamentally wrong with using SHA1 as a content hash. It's still a great hash. Now, my initial reaction (since it's been discussed for so long anyway) was obviously "pick a different hash". That was everybody's initial reaction, I think. But I'm starting to think that maybe that initial obvious reaction was wrong. The thing that makes collision attacks so nasty is that our reaction to a collision is so deadly. But that's not necessarily fundamental: we certainly uses hashes with collisions every day, and they work fine. And they work fine because the code that uses those hashes is designed to simply deal gracefully - although very possibly with a performance degradation - with two different things hashing to the same bucket. So what if the solution to "SHA1 has potential collisions" is "any hash can have collisions in theory, let's just make sure we handle them gracefully"? Because I'm looking at our data structures that have hashes in them, and many of them are actually of the type where I go "Hmm.. Replacing the hash entirely is really really painful - but it wouldn't necessarily be all that nasty to extend the format to have additional version information". and the advantage of that approach is that it actually makes the compatibility part trivial. No more "have to pick a different hash and two different formats", and more of a "just the same format with extended information that might not be there for old objects". So we have a few different types of formats: - the purely local data structures: the pack index file, the file index, our refs etc These we could in change completely, and it wouldn't even be all that painful. The pack index has already gone through versions, and it doesn't affect anything else. - the actual objects. These are fairly painful to change, particularly things like the "tree" object which is probably the worst designed of the lot. Making it contain a fixed-size binary thing was really a nasty mistake. My bad. - the pack format and the protocol to exchange "I have this" information This is *really* painful to change, because it contains not just the raw object data, but it obviously ends up being the wire format for remote accesses. and it turns out that *all* of these formats look like they would be fairly easy to extend to having extra object version information. Some of that extra object version information we already have and don't use, in fact. Even the tree format, with the annoying fixed-size binary blob. Yes,
Re: [PATCH] Put sha1dc on a diet
On Thu, Mar 2, 2017 at 10:37 AM, Jeff Hostetlerwrote: >> >> Now, if your _file_ index is 300-400MB (and I do think we check the >> SHA fingerprint on that even on just reading it - verify_hdr() in >> do_read_index()), then that's going to be a somewhat noticeable hit on >> every normal "git diff" etc. > > Yes, the .git/index is 450MB with ~3.1M entries. verify_hdr() is called > each time we read it into memory. Ok. So that's really just a purely historical artifact. The file index is actually the first part of git to have ever been written. You can't even see it in the history, because the initial revision from Apr 7, 2005, obviously depended on the actual object hashing. But the file index actually came first. You can _kind_ of see that in the layout of the original git tree, and how the main header file is still called "cache.h", and how the original ".git" directory was actually called ".dircache". And the two biggest files (by a fairly big margin) are "read-cache.c" and "update-cache.c". So that file index cache was in many ways _the_ central part of the original git model. The sha1 file indexing and object database was just the backing store for the file index. But part of that history is then how much I worried about corruption of that index (and, let's face it, general corruption resistance _was_ one of the primary design goals - performance was high up there too, but safety in the face of filesystem corruption was and is a primary issue). But realistically, I don't think we've *ever* hit anything serious on the index file, and it's obviously not a security issue. It also isn't even a compatibility issue, so it would be trivial to just bump the version header and saying that the signature changes the meaning of the checksum. That said: > We have been testing a patch in GfW to run the verification in a separate > thread > while the main thread parses (and mallocs) the cache_entries. This does help > offset the time. Yeah, that seems an even better solution, honestly. The patch would be cleaner without the NO_PTHREADS things. I wonder how meaningful that thing even is today. Looking at what seems to select NO_PTHREADS, I suspect that's all entirely historical. For example, you'll see it for QNX etc, which seems wrong - QNX definitely has pthreads according to their docs, for example. Linus
Re: [PATCH] Put sha1dc on a diet
On Thu, Mar 2, 2017 at 6:45 AM, Johannes Schindelinwrote: > > It would probably make sense to switch the index integrity check away from > SHA-1 because we really only care about detecting bit flips there, and we > have no need for the computational overhead of using a full-blown > cryptographic hash for that purpose. Which index do you actually see as being a problem, btw? The main file index (.git/index) or the pack-file indexes? We definitely don't need the checking version of sha1 for either of those, but as Jeff already did the math, at least the pack-file index is almost negligible, because the pack-file operations that update it end up doing SHA1 over the objects - and the object SHA1 calculations are much bigger. And I don't think we even check the pack-file index hashes except on fsck. Now, if your _file_ index is 300-400MB (and I do think we check the SHA fingerprint on that even on just reading it - verify_hdr() in do_read_index()), then that's going to be a somewhat noticeable hit on every normal "git diff" etc. But I'd have expected the stat() calls of all the files listed by that index to be the _much_ bigger problem in that case. Or do you just turn those off with assume-unchanged? Yeah, those stat calls are threaded when preloading, but even so.. Anyway, the file index SHA1 checking could probably just be disabled entirely (with a config flag). It's a corruption check that simply isn't that important. So if that's your main SHA1 issue, that would be easy to fix. Everything else - like pack-file generation etc for a big clone() may end up using a ton of SHA1 too, but the SHA1 costs all scale with the other costs that drown them out (ie zlib, network, etc). I'd love to see a profile if you have one. Linus
Re: [PATCH] Put sha1dc on a diet
On Wed, Mar 1, 2017 at 11:07 AM, Jeff Kingwrote: > > So obviously the smaller object size is nice, and the diffstat is > certainly satisfying. My only qualm would be whether this conflicts with > the optimizations that Dan is working on (probably not conceptually, but > textually). Yeah. But I'll happily just re-apply the patch on any new version that Dan posts. The patch is obviously trivial, even if size-wise it's a fair number of lines. So I wouldn't suggest using the patched version as some kind of starting point. It's much easier to just take a new version of upstream and repeat the diet patch on it. .. and obviously later versions of the upstream sha1dc code may not even need this at all since Dan and Marc are now aware of the issue. Linus
Re: Delta compression not so effective
On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsenwrote: > > No, the list of git verify-objects in the previous post was from the bottom > of the sorted list, so those are the largest blobs, ~249MB.. .. so with a 6GB window, you should easily sill have 20+ objects. Not a huge window, but it should find some deltas. But a smaller window - _together_ with a suboptimal sorting choice - could then result in lack of successful delta matches. > So, this repo must be knocking several parts of Git's insides. I was curious > about why it was so slow on the writing objects part, since the whole repo > is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing > has ~400MB/s continuous throughput available. > > iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single > thread (since the "write objects" stage is single threaded, obviously). So the writing phase isn't multi-threaded because it's not expected to matter. But if you can't even generate deltas, you aren't just *writing* much more data, you're compressing all that data with zlib too. So even with a fast disk subsystem, you won't even be able to saturate the disk, simply because the compression will be slower (and single-threaded). > Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed > blobs are the same DLLs (multiple of them) I think the first thing you should test is to repack with fewer threads, and a bigger pack window. Do somethinig like -c pack.threads=4 --window-memory=30g instead. Just to see if that starts finding deltas. > Right, now on this machine, I really didn't notice much difference between > standard zlib level and doing -9. The 203GB version was actually with > zlib=9. Don't. zlib has *horrible* scaling with higher compressions. It doesn't actually improve the end result very much, and it makes things *much* slower. zlib was a reasonable choice when git started - well-known, stable, easy to use. But realistically it's a relatively horrible choice today, just because there are better alternatives now. >> Hos sensitive is your material? Could you make a smaller repo with >> some of the blobs that still show the symptoms? I don't think I want >> to download 206GB of data even if my internet access is good. > > Pretty sensitive, and not sure how I can reproduce this reasonable well. > However, I can easily recompile git with any recommended > instrumentation/printfs, if you have any suggestions of good places to > start? If anyone have good file/line numbers, I'll give that a go, and > report back? So the first thing you might want to do is to just print out the objects after sorting them, and before it starts trying to finsd deltas. See prepare_pack() in builtin/pack-objects.c, where it does something like this: if (nr_deltas && n > 1) { unsigned nr_done = 0; if (progress) progress_state = start_progress(_("Compressing objects"), nr_deltas); QSORT(delta_list, n, type_size_sort); ll_find_deltas(delta_list, n, window+1, depth, _done); stop_progress(_state); and notice that QSORT() line: that's what sorts the objects. You can do something like for (i = 0; i < n; i++) show_object_entry_details(delta_list[i]); right after that QSORT(), and make that print out the object hash, filename hash, and size (we don't have the filename that the object was associated with any more at that stage - they take too much space). Save off that array for off-line processing: when you have the object hash, you can see what the contents are, and match it up wuith the file in the git history using something like git log --oneline --raw -R --abbrev=40 which shows you the log, but also the "diff" in the form of "this filename changed from SHA1 to SHA1", so you can match up the object hashes with where they are in the tree (and where they are in history). So then you could try to figure out if that type_size_sort() heuristic is just particularly horrible for you. In fact, if your data is not *so* sensitive, and you're ok with making the one-line commit logs and the filenames public, you could make just those things available, and maybe I'll have time to look at it. I'm in the middle of the kernel merge window, but I'm in the last stretch, and because of the SHA1 thing I've been looking at git lately. No promises, though. Linus
Re: [PATCH] Put sha1dc on a diet
On Wed, Mar 1, 2017 at 12:34 PM, Jeff Kingwrote: > > I don't think that helps. The sha1 over the pack-file takes about 1.3s > with openssl, and 5s with sha1dc. So we already know the increase there > is only a few seconds, not a few minutes. Yeah, I did a few statistics by adding just logging of "SHA1_Init()" calls. For that network clone situation, the call distribution is 1SHA1: Init at builtin/index-pack.c:326 841228SHA1: Init at builtin/index-pack.c:450 2SHA1: Init at csum-file.c:152 4415756SHA1: Init at sha1_file.c:3218 (the line numbers are a bit off from 'pu', because I obviously have the logging code). The big number (one for every object) is from write_sha1_file_prepare(), which we'd want to be the strong collision checking version because those are things we're about to create git objects out of. It's called from - hash_sha1_file() - doesn't actually write the object, but is used to calculate the sha for incoming data after applying the delta, for example. - write_sha1_file() - many uses, actually writes the object - hash_sha1_file_literally() - git hash-object and that index-pack.c:450 is from unpack_entry_data() for the base non-delta objects (which should also be the strong kind). So all of them should check against collision attacks, so none of them seem to be things you'd want to optimize away.. So I was wrong in thinking that there were a lot of unnecessary SHA1 calculations in that load. They all look like they should be done with the slower checking code. Oh well. Linus
Re: [PATCH] Put sha1dc on a diet
On Wed, Mar 1, 2017 at 2:51 PM, Johannes Schindelinwrote: > > But I think bigger than just developers on Windows OS. There are many > developers out there working on large repositories (yes, much larger than > Linux). Also using Macs and Linux. I am not at all sure that we want to > give them an updated Git they cannot fail to notice to be much slower than > before. Johannes, have you *tried* the patches? I really don't think you have. It is completely unnoticeable in any normal situation. The two cases that it's noticeable is: - a full fsck is noticeable slower - a full non-local clone is slower (but not really noticeably so since the network traffic dominates). In other words, I think you're making shit up. I don't think you understand how little the SHA1 performance actually matters. It's noticeable in benchmarks. It's not noticeable in any normal operation. .. and yes, I've actually been running the patches locally since I posted my first version (which apparently didn't go out to the list because of list size limits) and now running the version in 'pu'. Linus
Re: [PATCH] Put sha1dc on a diet
On Wed, Mar 1, 2017 at 1:56 PM, Johannes Schindelinwrote: > Footnote *1*: I know, it is easy to forget that some developers cannot > choose their tools, or even their hardware. In the past, we seemed to take > appropriate care, though. I don't think you need to worry about the Windows side. That can continue to do something else. When I advocated perhaps using USE_SHA1DC by default, I definitely did not mean it in a "everywhere, regardless of issues" manner. For example, the conmfig.mak.uname script already explicitly asks for "BLK_SHA1 = YesPlease" for Windows. Don't bother changing that, it's an explicit choice. But the Linux rules don't actually specify which SHA1 version to use, so the main Makefile currently defaults to just using openssl. So that's the "default" choice I think we might want to change. Not the "we're windows, and explicitly want BLK_SHA1 because of environment and build infrastructure". Linus
Re: Delta compression not so effective
On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsenwrote: > > Indeed, I did do a > -c pack.threads=20 --window-memory=6g > to 'git repack', since the machine is a 20-core (40 threads) machine with > 126GB of RAM. > > So I guess with these sized objects, even at 6GB per thread, it's not enough > to get a big enough Window for proper delta-packing? Hmm. The 6GB window should be plenty good enough, unless your blobs are in the gigabyte range too. > This repo took >14hr to repack on 20 threads though ("compression" step was > very fast, but stuck 95% of the time in "writing objects"), so I can only > imagine how long a pack.threads=1 will take :) Actually, it's usually the compression phase that should be slow - but if something is limiting finding deltas (so that we abort early), then that would certainly tend to speed up compression. The "writing objects" phase should be mainly about the actual IO. Which should be much faster *if* you actually find deltas. > But arent't the blobs sorted by some metric for reasonable delta-pack > locality, so even with a 6GB window it should have seen ~25 similar objects > to deltify against? Yes they are. The sorting for delta packing tries to make sure that the window is effective. However, the sorting is also just a heuristic, and it may well be that your repository layout ends up screwing up the sorting, so that the windows just work very badly. For example, the sorting code thinks that objects with the same name across the history are good sources of deltas. But it may be that for your case, the binary blobs that you have don't tend to actually change in the history, so that heuristic doesn't end up doing anything. The sorting does use the size and the type too, but the "filename hash" (which isn't really a hash, it's something nasty to give reasonable results for the case where files get renamed) is the main sort key. So you might well want to look at the sorting code too. If filenames (particularly the end of filenames) for the blobs aren't good hints for the sorting code, that sort might end up spreading all the blobs out rather than sort them by size. And again, if that happens, the "can I delta these two objects" code will notice that the size of the objects are wildly different and won't even bother trying. Which speeds up the "compressing" phase, of course, but then because you don't get any good deltas, the "writing out" phase sucks donkey balls because it does zlib compression on big objects and writes them out to disk. So there are certainly multiple possible reasons for the deltification to not work well for you. Hos sensitive is your material? Could you make a smaller repo with some of the blobs that still show the symptoms? I don't think I want to download 206GB of data even if my internet access is good. Linus
Re: [PATCH] Put sha1dc on a diet
On Wed, Mar 1, 2017 at 10:42 AM, Junio C Hamanowrote: > I see //c99 comments sha1dc is already full of // style comments. I just followed the existing practice. > and also T array[] = { [58] = val } both of > which I think we stay away from (and the former is from the initial > import), so some people on other platforms MAY have trouble with > this topic. Hmm. The "{ [58] = val; }" kind of initialization would be easy to work around by just filling in everything else with NULL, but it would make for a pretty nasty readability issue. That said, if you mis-count the NULL's, the end result will pretty immediately SIGSEGV, so I guess it wouldn't be much of a maintenance problem. But if you're just willing to take the "let's see" approach, I think the explicitly numbered initializer is much better. The main people who I assume would really want to use the sha1dc library are hosting places. And they won't be using crazy compilers from the last century. That said, I think that it would be lovely to just default to USE_SHA1DC and just put the whole attack behind us. Yes, it's slower. No, it doesn't really seem to matter that much in practice. Linus
Re: Delta compression not so effective
On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsenwrote: > > When first importing, I disabled gc to avoid any repacking until completed. > When done importing, there was 209GB of all loose objects (~670k files). > With the hopes of quick consolidation, I did a > git -c gc.autoDetach=0 -c gc.reflogExpire=0 \ > -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \ > -c gc.rerereunresolved=0 -c gc.pruneExpire=now \ > gc --prune > which brought it down to 206GB in a single pack. I then ran > git repack -a -d -F --window=350 --depth=250 > which took it down to 203GB, where I'm at right now. Considering that it was 209GB in loose objects, I don't think it delta-packed the big objects at all. I wonder if the big objects end up hitting some size limit that causes the delta creation to fail. For example, we have that HASH_LIMIT that limits how many hashes we'll create for the same hash bucket, because there's some quadratic behavior in the delta algorithm. It triggered with things like big files that have lots of repeated content. We also have various memory limits, in particular 'window_memory_limit'. That one should default to 0, but maybe you limited it at some point in a config file and forgot about it? Linus
[PATCH] Put sha1dc on a diet
From: Linus Torvalds <torva...@linux-foundation.org> Date: Tue, 28 Feb 2017 16:12:32 -0800 Subject: [PATCH] Put sha1dc on a diet This removes the unnecessary parts of the sha1dc code, shrinking things from [torvalds@i7 git]$ size sha1dc/*.o textdata bss dec hex filename 277559 640 0 278199 43eb7 sha1dc/sha1.o 4438 11352 0 157903dae sha1dc/ubc_check.o to [torvalds@i7 git]$ size sha1dc/*.o textdata bss dec hex filename 13287 0 0 1328733e7 sha1dc/sha1.o 4438 11352 0 157903dae sha1dc/ubc_check.o so the sha1.o text size shrinks from about 271kB to about 13kB. The shrinking comes mainly from only generating the recompressio functions for the two rounds that are actually used (58 and 65), but also from removing a couple of other unused functions. The sha1dc library lost its "safe_hash" parameter to do that, since we check - and refuse to touch - the colliding cases manually. The git binary itself is about 2MB of text on my system. For other helper binaries the size reduction is even more noticeable. A quarter MB here and a quarter MB there, and suddenly you have a big binary ;) This has been tested with the bad pdf image: [torvalds@i7 git]$ ./t/helper/test-sha1 < ~/Downloads/bad.pdf fatal: The SHA1 computation detected evidence of a collision attack; refusing to process the contents. although we obviously still don't have an actual git object to test with. The normal git test-suite obviously also passes. Signed-off-by: Linus Torvalds <torva...@linux-foundation.org> --- I notice that sha1dc is in the 'pu' branch now, so let's put my money where my mouth is, and send in the sha1dc diet patch. sha1dc/sha1.c | 356 +++--- sha1dc/sha1.h | 24 2 files changed, 18 insertions(+), 362 deletions(-) diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c index 6569b403e..4910f0c35 100644 --- a/sha1dc/sha1.c +++ b/sha1dc/sha1.c @@ -9,6 +9,15 @@ #include "sha1dc/sha1.h" #include "sha1dc/ubc_check.h" +// function type for sha1_recompression_step_T (uint32_t ihvin[5], uint32_t ihvout[5], const uint32_t me2[80], const uint32_t state[5]) +// where 0 <= T < 80 +// me2 is an expanded message (the expansion of an original message block XOR'ed with a disturbance vector's message block difference) +// state is the internal state (a,b,c,d,e) before step T of the SHA-1 compression function while processing the original message block +// the function will return: +// ihvin: the reconstructed input chaining value +// ihvout: the reconstructed output chaining value +typedef void(*sha1_recompression_type)(uint32_t*, uint32_t*, const uint32_t*, const uint32_t*); + #define rotate_right(x,n) (((x)>>(n))|((x)<<(32-(n #define rotate_left(x,n) (((x)<<(n))|((x)>>(32-(n @@ -39,212 +48,14 @@ -void sha1_message_expansion(uint32_t W[80]) +static void sha1_message_expansion(uint32_t W[80]) { unsigned i; for (i = 16; i < 80; ++i) W[i] = rotate_left(W[i - 3] ^ W[i - 8] ^ W[i - 14] ^ W[i - 16], 1); } -void sha1_compression(uint32_t ihv[5], const uint32_t m[16]) -{ - uint32_t W[80]; - uint32_t a, b, c, d, e; - unsigned i; - - memcpy(W, m, 16 * 4); - for (i = 16; i < 80; ++i) - W[i] = rotate_left(W[i - 3] ^ W[i - 8] ^ W[i - 14] ^ W[i - 16], 1); - - a = ihv[0]; - b = ihv[1]; - c = ihv[2]; - d = ihv[3]; - e = ihv[4]; - - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 0); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 1); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 2); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 3); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 4); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 5); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 6); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 7); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 8); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 9); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 10); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 11); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 12); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 13); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 14); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 15); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 16); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 17); - HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d,
Re: Typesafer git hash patch
On Tue, Feb 28, 2017 at 12:19 PM, brian m. carlsonwrote: > > The bigger issue is the assumptions in the code base that assume a given > hash size. Absolutely. And I think those are going to be the "real" patches. I actually saw your status report about "After another 27 commits, I've got it down from 1244 to 1119." when I had just started, and I applauded the approach, but the numbers made me go "ugh, that sounds painful". What I actually wanted to do was to write a coccinelle script to automate it, but it wasn't entirely obvious, and I started out just doing it with "sed" and hand-fixing, and it got tedious byt then I had done X% and just kept going. I think an automated script that also guarantees that no code generation could possibly change would have been lovely. We do that over the kernel occasionally, where those kinds of patches that are four thousand insertions/deletions in size are called "last Tuesday". So my approach was really just brute-force and stupid, I admit it. I'm not proud of the patch either. It might be one of those "bandaid removal moments" - get it over and done with and just rip that thing off ;) So the patch is 216 files changed, 4174 insertions(+), 4080 deletions(-) and about 27k lines and 950kB total size. The list limits are apparently much lower than that, but I'll happily send it to people in private. Linus
Re: SHA1 collisions found
On Tue, Feb 28, 2017 at 2:50 PM, Marc Stevenswrote: > > Because we only have 32 disturbance vectors to check, we have DVMASKSIZE > equal to 1 and maski always 0. > In the more general case when we add disturbance vectors this will not > remain the case. Ok, I didn't get why that happened, but it makes sense to me now. > Of course for dedicated code this can be simplified, and some parts > could be further optimized. So I'd be worried about changing your tested code too much, since the only test-cases we have are the two pdf files. If we screw up too much, those will no longer show as collisions, but we could get tons of false positives that we wouldn't see, so.. I'm wondering that since the disturbance vector cases themselves are a fairly limited number, perhaps the code that generates this could be changed to actually just generate the static calls rather than the loop over the sha1_dvs[] array. IOW, instead of generating this: for (i = 0; sha1_dvs[i].dvType != 0; ++i) { .. use sha1_dvs[i] values as indexes into function arrays etc.. } maybe you could just make the generator generate that loop statically, and have 32 function calls with the masks as constant arguments. .. together with only generating the SHA1_RECOMPRESS() functions for the cases that are actually used. So it would still be entirely generated, but it would generate a little bit more explicit code. Of course, we could just edit out all the SHA_RECOMPRESS(x) cases by hand, and only leave the two that are actually used. As it is, the lib/sha1.c code generates about 250kB of code that is never used if I read the code correctly (that's just the sha1.c code - entirely ignoring all the avx2 etc versions that I haven't looked at, and that I don't think git would use) > Regarding the recompression functions, the ones needed are given in the > sha1_dvs table, > but also via preprocessor defines that are used to actually only store > needed internal states: > #define DOSTORESTATE58 > #define DOSTORESTATE65 Yeah, I guess we could use those #define's to cull the code "automatically". But I think you could do it at the generation phase easily too, so that we don't then introduce unnecessary differences when we try to get rid of the extra fat ;) > Note that as each disturbance vector has its own unique message differences > (leading to different values for ctx->m2), our code does not loop over > just 2 items. > It loops over 32 distinct computations which have either of the 2 > starting points. Yes, I already had to clarify my badly expressed writing on the git list. My concerns were about the (very much not obvious) limited values in the dvs array. So the code superficially *looks* like it uses all those functions you generate (and the maski value _looked_ like it was interesting), but when looking closer it turns out that there's just a two different function calls that it loops over (but it loops over them multiple times, I agree). Linus