Re: What's cooking in git.git (Nov 2018, #06; Wed, 21)

2018-11-22 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 6:08 PM Stephen P. Smith  wrote:
>
> Picking up someones stalled patch is one thing, picking up Linus' patch is in
> a different league.

No, I think it works the other way - my random patches are likely the
_least_ important ones, simply because I can carry them around in my
own tree anyway, and my workflows may be odd by most git standards.

If I'm the only user of the "human" date format, then Junio is
absolutely right to not move it to the main branch.

I have some other private patches in my tree that have been mentioned
on the list but that nobody else got excited about:

 - Add 'human' date mode

 - Make 'git gc --prune=now' mean 'prune after start of command'

 - ls-remote: add "--diff" option to show only refs that differ

none of which are in the least important, but that I find personally useful.

  Linus


Re: Hash algorithm analysis

2018-09-18 Thread Linus Torvalds
On Tue, Sep 18, 2018 at 8:18 AM Joan Daemen  wrote:
>
> 3) The relatively large state in the sponge construction increases the 
> generic strength against attacks when the input contains redundancy or
> has a certain form. For instance, if the input is restricted to be text in 
> ASCII (such as source code), then the collision-resistance grows
> higher than the nominal 2^{c/2}. Such an effect does not exist with 
> narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.)

Answering to just this part:

No, what I had in mind was literally just exactly the kind of attack
that SHA1 broke for - attacking the internal state vector directly,
and not paying any penalty for it, because the stat size is the same
as the final hash size.

The length extension attack is just the simplest and most trivial
version of that kind of attack - because the internal state vector
*is* the result, and you just continue using it.

But that trivial length extension thing not the real problem, it's
just the absolutely simplest symptom of the real problem.

I think that the model where the internal state of the hash is the
same width as the final result is simply broken. It was what broke
SHA1, and that problem is shared with SHA2.

"Length extension" is just the simplest way to say "broken by design", imho.

Because the length extension attack is just the most trivial attack,
but it isn't the fundamental problem. It was just the first and the
cheapest attack found, but it was also the most special-cased and
least interesting. You need to have a very special case (with that
secret at the beginning etc) to make the pure length extension attack
interesting. And git has no secrets, so in that sense "length
extension" by itself is totally immaterial. But the basic problem of
internal hash size obviously wasn't.

So I would say that length extension is a direct result of the _real_
problem, which is that the hash exposes _all_ of the internal data.

That is what makes length extension possible - because you can just
continue from a known state, and there is absolutely nothing hidden -
and yes, that's a really easy special case where you don't even need
to actually break the hash at all.

But I argue that it's _also_ one big part of what made SHAttered
practical, and I think the underlying problem is exactly the same.
When the internal state is the same size as the hash, you can attack
the internal state itself for basically the same cost as attacking the
whole hash.

So you can pick-and-choose the weakest point.

Which is basically exactly what SHAttered did. No, it wasn't the
trivial "just add to the end", but it used the exact same underlying
weakness as one part of the attack.

*This* is why I dislike SHA2. It has basically the exact same basic
weakness that we already know SHA1 fell for. The hashing details are
different, and hopefully that means that there aren't the same kind of
patterns that can be generated to do the "attack the internal hash
state" part, but I don't understand why people seem to ignore that
other fundamental issue.

Something like SHA-512/256 would have been better, but I think almost
nobody does that in hardware, which was one of the big advantages of
plain SHA2.

The main reason I think SHA2 is acceptable is simply that 256 bits is
a lot. So even if somebody comes up with a shortcut that weakens it by
tens of bits, nobody really cares. Plus I'm obviously not a
cryptographer, so I didn't feel like I was going to fight it a lot.

But yes, I'd have probably gone with any of the other alternatives,
because I think it's a bit silly that we're switching hashes to
another hash that has (at least in part) the *exact* same issue as the
one people call broken.

(And yes, the hashing details are different, so it's "exactly the
same" only wrt that internal state part - not the bitpattern finding
part that made the attack on the internal state much cheaper. Real
cryptographers obviously found that "figure out the weakness of the
hashing" to be the more interesting and novel part over the trivial
internal hash size part).

That said..

The real reason I think SHA2 is the right choice was simply that there
needs to be a decision, and none of the choices were *wrong*.
Sometimes just the _act_ of making a decision is more important than
_what_ the decision is.

And hey, it is also likely that the reason _I_ get hung up on just the
size of the internal state is that exactly because I am _not_ a
cryptographer, that kind of high-level stuff is the part I understand.
When you start talking about why the exact rules of Merkle–Damgård
constructions work, my eyes just glaze over.

So I'm probably - no, certainly - myopic and looking at only one part
of the issue to begin with.

The end result is that I argued for more bits in the internal state
(and apparently wide vs narrow is the technical term), and I would
have seen parallel algorithms as a bonus for the large-file case. None
of which argued for SHA2.

But 

Re: "less -F" is broken

2018-08-16 Thread Linus Torvalds
On Thu, Aug 16, 2018 at 9:50 AM Mark Nudelman
 wrote:
>
> So I'm not sure what the best solution is.  Linus's proposal to disable
> the line counting stuff if -X is set seems reasonable.  I will look into
> that and see if there are any issues with it.

One option that I didn't try to go for - because I just don't know the
less code base well enough - is to basically make the behavior of '-F'
be something like this:

 - as long as all the lines are short and well-behaved, and we haven't
seen enough lines to fill the screen, act like 'cat' and just feed
them through

 - when you fill the screen (or when you hit some other condition that
makes you go "now I won't exit" - that could be a long line, but maybe
it could also be the user giving keyboard input for a less command?)
you send the init sequence and just redraw the whole screen.

That sounds like the best of both worlds.

In fact, right now "less -F" is in my opinion a bit broken in other
ways that my patch doesn't fix.

Do this:

(echo 1; sleep 10; echo 2) | LESS=FX less

and with my patch it will show "a" immediately. So far so good.

But let's say that that was all the user was interested in, and the
user presses 'q' to quit less. That doesn't work at all - it will wait
for that full ten seconds.

That actually happens even without -F too.

Wouldn't it be good to react to things like searches to highlight
something (and to 'quit' for the 'never mind, alteady got it' case)
even if there isn't enough data to fill the whole screen yet?

that said, ^C works, and this is not new behavior, so I'm just
throwing this out as a "maybe a different approach would fix _both_
the -F behavior _and_ the above traditional issue"?

Linus


Re: "less -F" is broken

2018-08-15 Thread Linus Torvalds
On Wed, Aug 15, 2018 at 2:29 PM Ævar Arnfjörð Bjarmason
 wrote:
>
> Downloading & trying versions of it locally reveals that it's as of
> version 520, not 530. The last version before 520 is 487. Presumably
> it's covered by this item in the changelog:
>
> Don't output terminal init sequence if using -F and file fits on one
> screen[1]

Side note: that's sad, because we already use X in the default exactly
for that reason.

So apparently "less" was broken for us to fix something that we
already had long solved. The code basically tried to do "automatic X
when F is set".

And all that line_count stuff (which is what breaks) is pointless when
-X is already given.

That does give a possible fix: just stop doing the line_count thing if
no_init is set.

So "-F" would continue to be broken, but "-FX" would work.

Something like the attached patch, perhaps?

Linus
 main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/main.c b/main.c
index 179bd78..961a9db 100644
--- a/main.c
+++ b/main.c
@@ -59,6 +59,7 @@ extern int	missing_cap;
 extern int	know_dumb;
 extern int	pr_type;
 extern int	quit_if_one_screen;
+extern int	no_init;
 
 
 /*
@@ -274,7 +275,7 @@ main(argc, argv)
 	{
 		if (edit_stdin())  /* Edit standard input */
 			quit(QUIT_ERROR);
-		if (quit_if_one_screen)
+		if (quit_if_one_screen && !no_init)
 			line_count = get_line_count();
 	} else 
 	{


Re: "less -F" is broken

2018-08-15 Thread Linus Torvalds
On Wed, Aug 15, 2018 at 2:29 PM Ævar Arnfjörð Bjarmason
 wrote:
>
> FWIW they're not linked from
> http://www.greenwoodsoftware.com/less/download.html but you can just URL
> hack and see releases http://www.greenwoodsoftware.com/less/ and change
> links like http://www.greenwoodsoftware.com/less/less-530.tar.gz to
> http://www.greenwoodsoftware.com/less/less-520.tar.gz

I should have just tried that. I just downloaded the ones linked to,
made a git archive of the history, and started bisecting. Which was
all pointless extra work, since it was in the last release, but
whatever.

> > I've reported it as a bug in less, but I'm not sure what the reaction
> > will be, the less releases seem to be very random.
>
> Via bug-l...@gnu.org?

Heh. Another thing I didn't actually find. No, I just emailed Mark
Nudelman directly, because that's what the FAQ says to do:

 "There is a list of known bugs here. If you find a bug that is not in
the list, please send email to the author. Describe the bug in as much
detail as possible, and I'll do what I can to help resolve the
problem."

and it doesn't mention any mailing list.

> Is this report available online somewhere?

It was not all that different from the email to the git list - just
giving the trivial test-case and my (limited) bisection result.

The data you dug up is much more useful.

Linus


"less -F" is broken

2018-08-15 Thread Linus Torvalds
Sadly, as of less-530, the behavior of "less -F" is broken enough that
I think git needs to potentially think about changing the defaults for
the pager, or people should at least be aware of it.

Older versions of less (at least up to less-487 - March 2017) do not
have this bug.  There were apparently 520, 527 and 529 releases in
2017 too, but I couldn't find their sources to verify if they were
already broken - but 530 (February 2018) has the problem.

The breakage is easy to see without git:

(echo "hello"; sleep 5; echo "bye bye") | less -F

which will result in no output at all for five seconds, and then you
get both lines at once as "less" exits.

It's not always obvious when using git, because when the terminal
fills up, less also starts outputting, but the default options with -F
are really horrible if you are looking for something uncommon, and
"git log" doesn't respond at all.

On the kernel tree, this is easy to see with something like

   git log --oneline --grep="The most important one is the mpt3sas fix"

which takes a bit over 7 seconds before it shows the commit I was looking for.

In contrast, if you do

   LESS=-RX git log --oneline --grep="The most important one is the mpt3sas fix"

that (recent) commit is found and shown immediately. It still takes 7s
for git to go through all history and decide "that was it", but at
least you don't need to wait for the intermediate results.

I've reported it as a bug in less, but I'm not sure what the reaction
will be, the less releases seem to be very random.

 Linus


Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash

2018-08-03 Thread Linus Torvalds
On Fri, Aug 3, 2018 at 9:40 AM Junio C Hamano  wrote:
>
> > [...]
> >> -  - 20-byte NewHash checksum of all of the above.
> >> +  - 20-byte SHA-256 checksum of all of the above.
> >
> > Likewise.
>
> Hmph, I've always assumed since NewHash plan was written that this
> part was not about tamper resistance but was about bit-flipping
> detection and it was deliberate to stick to 20-byte sum, truncating
> as necessary.

Yeah, in fact, since this was one area where people actually had
performance issues with the hash, it might be worth considering
_weakening_ the hash.

Things like the pack index files (and just the regular file index) are
entirely local, and the SHA1 in those is really just a fancy CRC. It's
really just "good protection against disk corruption" (it happens),
and in fact you cannot use it as protection against active tampering,
since there's no secret there and any active attacker that has access
to your local filesystem could just recompute the hash anyway.

And because they are local anyway and aren't really transported
(modulo "shared repositories" where you use them across users or
legacy rsync-like synchronization), they can be handled separately
from any hashing changes. The pack and index file formats have in fact
been changed before.

It might make sense to either keep it as SHA1 (just to minimize any
changes) or if there are still issues with index file performance it
could even be made to use something fast-but-not-cryptographic like
just making it use crc32().

Remember: one of the original core git design requirements was
"corruption detection".

For normal hashed objects, that came naturally, and the normal object
store additionally has active tamper protection thanks to the
interconnected nature of the hashes and the distribution of the
objects.

But for things like packfiles and the file index, it is just a
separate checksum. There is no tamper protection anyway, since anybody
who can modify them directly can just recompute the hash checksum.

The fact that both of these things used SHA1 was more of a convenience
issue than anything else, and the pack/index file checksum is
fundamentally not cryptographic (but a cryptographic hash is obviously
by definition also a very good corruption detector).

   Linus


Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash

2018-08-03 Thread Linus Torvalds
On Fri, Aug 3, 2018 at 12:20 AM Jonathan Nieder  wrote:
>
>
> Here I'd want to put a pile of acks, e.g.:
>
>  Acked-by: Linus Torvalds 
> ..

Sure, feel free to add my Ack for this.

   Linus


Re: [RFC PATCH v2] Add 'human' date format

2018-07-24 Thread Linus Torvalds
On Tue, Jul 24, 2018 at 2:49 PM Junio C Hamano  wrote:
>
> But lack of TZ does not give me enough hint about which content it
> happened.  The fact that this was done late at night on weekend is
> indeed interesting, and I may not care what time it locally was for
> me, so perhaps this is an intended behaviour.

I'm not sure you could call it "intended". The TZ hiding did change as
I was playing with this.

The first version of the patch only hid the timezone if it matched the
current one.

Because, as you say, the time zone can be interesting not so much
because you care about *when* the commit happened, but you care about
*where* the commit happened. And that can be true even if the commit
is very old.

So the timezone data in some sense isn't necessarily about the date at all.

When I used it a bit more (I still have the "--date=auto" as my
default for the kernel), I decided I really don't much care about the
timezone. In any _individual_ case, the timezone looks fine, but when
you look at many different dates, it looks really odd how it sometimes
shows, and sometimes does not, particularly for old dates when it
really doesn't matter for the *time* reason.

So I decided that it's better to not show the timezone at all when you
show a real date.

But honestly, I don't claim to have a really strong argument. It's
just a choice. Nothing says it's the absolute right choice.

I pointed out that you can use "--date=human-local" to get an even
denser representation that gives you the human date without ever
having a TZ. But we don't have the reverse of "-local", which would
explicitly show the timezones.

Again, I think this is really because the timezone is about something
other than just the time. I think the "do we care *where* it was done
or not?" question in many ways is entirely independent of the time
question.

So right now the patch says

hide.tz |= !hide.date;

which ends up being good for the "times are roughly the same size"
(which I decided was a good thing - again, I don't really have a
hugely strong argument for it, it was a matter of me playing with
options).

But it would make equally much sense to say

hide.tz |= hide.time;

and just say that the timezone is hidden if it matches the current
one, or if the commit is just so old that we don't show the time at
all.

OR you could just say "timezone is always interesting, because you
want to know _where_ it was done regardless of _when_ it was done",
and just not hide the timezone at all.

I think all are "technically valid" choices to make. The one I made
was just a random personal preference, not necessarily the right one.

Could we extend on the "decorations" (like the "-local" thing)?
Absolutely.  I'm not sure it's worth doing, but it would certainly
solve the "different people have different preferences" issue.

I think that *if* we want to extend on the decorations, that would
probably still be a separate patch from the basic patch.

   Linus


Re: Hash algorithm analysis

2018-07-24 Thread Linus Torvalds
On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson
 wrote:
>
> Switching gears, if I look at this from the perspective of the libgit2
> project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
> blake2b, we'd have to include - and support - that code ourselves.  But
> to support SHA-256, we would simply use the system's crypto libraries
> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
> SecureTransport).

I think this is probably the single strongest argument for sha256.
"It's just there".

The hardware acceleration hasn't become nearly as ubiquitous as I
would have hoped, and honestly, sha256 _needs_ hw acceleration more
than some of the alternatives in the first place.

But sha256 does have the big advantage of just having been around and
existing in pretty much every single crypto library.

So I'm not a huge fan of sha256, partly because of my disappointment
in lack of hw acceleration in releant markets (sure, it's fairly
common in ARM, but nobody sane uses ARM for development because of
_other_ reasons). And partly because I don't like how the internal
data size is the same as the final hash. But that second issue is an
annoyance with it, not a real issue - in the absence of weaknesses
it's a non-issue, and any future weaknesses might affect any other
choice too.

So hey, if people are actually at the point where the lack of choice
holds up development, we should just pick one. And despite what I've
said in this discussion, sha256 would have been my first choice, just
because it's the "obvious" choice. The exact same way that SHA1 was
the obvious choice (for pretty much the same infrastructure reasons)
back in 2005.

And maybe the hw acceleration landscape will actually improve. I think
AMD actually does do the SHA extensions in Zen/TR.

So I think Junio should just pick one. And I'll stand up and say

 "Let's just pick one.

  And sha256 is certainly the safe choice in that it won't strike
  anybody as being the _wrong_ choice per se, even if not everybody will
  necessarily agree it's the _bext_ choice".

but in the end I think Junio should be the final arbiter. I think all
of the discussed choices are perfectly fine in practice.

Btw, the one thing I *would* suggest is that the git community just
also says that the current hash is not SHA1, but SHA1DC.

Support for "plain" SHA1 should be removed entirely. If we add a lot
of new infrastructure to support a new more secure hash, we should not
have the old fallback for the known-weak one. Just make SHA1DC the
only one git can be built with.

   Linus


Re: Hash algorithm analysis

2018-07-23 Thread Linus Torvalds
On Mon, Jul 23, 2018 at 5:48 AM Sitaram Chamarty  wrote:
>
> I would suggest (a) hash size of 256 bits and (b) choice of any hash
> function that can produce such a hash.  If people feel strongly that 256
> bits may also turn out to be too small (really?) then a choice of 256 or
> 512, but not arbitrary sizes.

Honestly, what's the expected point of 512-bit hashes?

The _only_ point of a 512-bit hash is that it's going to grow objects
in incompressible ways, and use more memory. Just don't do it.

If somebody can break a 256-bit hash, you have two choices:

 (a) the hash function itself was broken, and 512 bits isn't the
solution to it anyway, even if it can certainly hide the problem

 (b) you had some "new math" kind of unexpected breakthrough, which
means that 512 bits might not be much  better either.

Honestly, the number of particles in the observable universe is on the
order of 2**256. It's a really really big number.

Don't make the code base more complex than it needs to be. Make a
informed technical decision, and say "256 bits is a *lot*".

The difference between engineering and theory is that engineering
makes trade-offs. Good software is well *engineered*, not theorized.

Also, I would suggest that git default to "abbrev-commit=40", so that
nobody actually *sees* the new bits by default. So the perl scripts
etc that use "[0-9a-f]{40}" as a hash pattern would just silently
continue to work.

Because backwards compatibility is important (*)

 Linus

(*) And 2**160 is still a big big number, and hasn't really been a
practical problem, and SHA1DC is likely a good hash for the next
decade or longer.


Re: Hash algorithm analysis

2018-07-21 Thread Linus Torvalds
On Sat, Jul 21, 2018 at 3:39 PM Johannes Schindelin
 wrote:
>
> Do you really want to value contributors' opinion more than
> cryptographers'? I mean, that's exactly what got us into this hard-coded
> SHA-1 mess in the first place.

Don't be silly.

Other real cryptographers consider SHA256 to be a problem.

Really. It's amenable to the same hack on the internal hash that made
for the SHAttered break.

So your argument that "cryptographers prefer SHA256" is simply not true.

Your real argument is that know at least one cryptographer that you
work with that prefers it.

Don't try to make that into some generic "cryptographers prefer it".
It's not like cryptographers have issues with blake2b either, afaik.

And blake2b really _does_ have very real advantages.

If you can actually point to some "a large percentage of
cryptographers prefer it", you'd have a point.

But as it is, you don't have data, you have an anecdote, and you try
to use that anecdote to put down other peoples opinions.

Intellectually dishonest, that is.

   Linus


Re: [RFC PATCH v2] Add 'human' date format

2018-07-11 Thread Linus Torvalds
On Wed, Jul 11, 2018 at 2:24 PM Ævar Arnfjörð Bjarmason
 wrote:
>
> I think that's true for the likes of linux.git & git.git, but a lot of
> users of git say work in some corporate setting entirely or mostly in
> the same timezone.
>
> In that case, knowing if some commit whose sole message was "fix"[1] was
> made at 3am or in the afternoon, even if it's really old, is really
> useful information, even years later.

Heh.

Maybe. But if  you care about that kind of information, would you
actually want to use the "human" date? Wouldn't you want to use the
strftime thing instead, which gets you whatever field you care about,
and gets it consistently regardless of how old the data is?

That said, I do acknowledge that the "human" format may be a bit
inflexible and ad-hoc. Of course some more generic way that allowed
arbitrary rules might be better for some uses.

I'll just explain the cases that made me zero in on what that last patch did:

 (a) I do like the "relative" date for recent stuff.

Quite often, I look at how recent the commits are, for example, and
then I really like seeing "2 hours ago" rather than a time with a
timezone (which is a lot harder for me to mentally parse)

This was the primary impetus for my original "auto" patch many years
ago, that was (rightly) not merged. It really boiled down to just
"default or relative, depending on how recent it was".

 (b) I noticed that I was distracted by dates that were *too* terse.

My first patch had _just_ the time when it was today and the same
timezone (but older than two hours, so the original relative logic
didn't trigger).

That initially sounded great to me, which is why it was that first time.

But after _using_ it for a while, I actually found that it didn't have
enough context for me (visually) to really trigger my date parsing at
all.

So "five hours ago" actually parsed better than just "9:48" to me. I
didn't think it would do that, but it did. Which was why I changed the
"relative" time to trigger every time if it was the exact same date
(and in the past) - just to avoid the really terse model.

 (c) when I played around with other commands than just "git log", I
also noticed that a consistent length mattered.,

Again, my first version was more along the lines of "if it's long ago,
just use the full format, exactly like the default date". It wasn't
*quite* that, because it would always skip the seconds, but it was
close.

And with "git log", that worked fine, because dates were fairly
uniformly increasing, so the date format would slowly get longer, and
that was fine.

But then when I played with "git blame -C --date=human", I noticed
that not only did the human date actually make sense there too, it
actually made it easier for me to read - and that in particular, the
"extra" info was just annoying.

So now I find that shortened "only show the date" format to be really
good _particularly_ for "git blame". You can see very clearly whether
it's something recent or something old.

Maybe my use of "git blame" is unusual, but I don't think so. I tend
to do "git blame -C" when I'm looking for a bug, and then seeing
something like this:

...
  Apr 16 2005   437)
  Apr 16 2005   438)
  Jan 14 2016   439)
  Apr 16 2005   440)
  Apr 16 2005   441)
  Apr 16 2005   442)
  Thu Jun 14 15:26  443)
  Thu Jun 14 15:26  444)
  Thu Jun 14 15:26  445)
  Thu Jun 14 15:26  446)
  Thu Jun 14 15:26  447)
  Thu Jun 14 15:26  448)
  Thu Jun 14 15:26  449)
  Thu Jun 14 15:26  450)
  Apr 16 2005   451)
  Jul 30 2012   452)
  Jul 30 2012   453)
  Feb 13 2012   454)
  Apr 16 2005   455)
  Apr 16 2005   456)


in that date field (yeah. that happens to be "kernel/fork.c" in the
current kernel - I just edited out all the other stuff than time and
line number) is actually very visually easy to see what parts are old,
and which ones are recent, because it changes the format pretty
clearly and unambiguously, without changing the size of that field
_dramatically_.

(Sure, the size changes, but it's not a radical difference, it's a
fairly small variation, and the variation only highlights the
different time range, without making it compltely unbalanced).

Anyway, enough excuses. I'l just trying to explain some of the things
that I noticed simply _while_ making some of the decisions I made.

Are they the "right" decisions? I don't know. But I've been running with that

git config --add log.date auto

in my kernel repo since I posted the patches, and so far I'm still liking it.

 Linus


Re: [RFC PATCH v2] Add 'human' date format

2018-07-11 Thread Linus Torvalds
[ Trying to come up with crazy special cases ]

On Wed, Jul 11, 2018 at 1:49 PM Linus Torvalds
 wrote:
>
> But it could be anything else invalid, of course. It could be MAX_INT
> or something like that.

That might be better. A timezone of -1 isn't actually a valid
timezone, but I guess you could create a commit by hand that had
"-0001" as the timezone.

You can't do that with something like MAX_INT, without fsck
complaining - since it has to be exactly four digits.

> The clearing of "human_tm" is done for a similar reason: the code does
>
> hide.year = tm->tm_year == human_tm->tm_year;
>
> (and then later just checks "if (human_tm->tm_year)") knowing that a
> non-zero tm_year will only ever happen for human_tz (and that 1900 is
> not a valid git date, even though I guess in theory you could do it).

Actually, the 1900 should be safe, because 'timestamp_t' is unsigned.
So a valid timestamp really can't be before 1970.

Of course, you can probably try to mess with it by giving values that
don't actually fit, because sometimes we do convert mindlessly from
'timestamp_t' to 'time_t'. In particular, if you use the
"default-local" time, it will use that

  static struct tm *time_to_tm_local(timestamp_t time)
  {
time_t t = time;
return localtime();
  }

and not check the range of the timestamp.

But other proper time stamp functions will actually do range checking
with "date_overflow()", so in general that whole assumption of "a real
git date cannot be in the year 1900" is valid.

  Linus


Re: [RFC PATCH v2] Add 'human' date format

2018-07-11 Thread Linus Torvalds
On Wed, Jul 11, 2018 at 1:34 PM Andrei Rybak  wrote:
>
> > + int human_tz = -1;
>
> Is -1 an OK initial value for timezone if local_time_tzoffset returns
> negative values as well? It looks like it doesn't matter for from functional

The value was intentionally picked to *not* be a valid timezone value,
so that the comparison of "human_tz == tz" would always fail if
DATE_HUMAN is not selected.

But it could be anything else invalid, of course. It could be MAX_INT
or something like that.

By picking something that isn't possibly a real timezone value, late
code can do things like

hide.tz = local || tz == human_tz;

without worrying about whther it's really DATE_HUMAN or not.

The clearing of "human_tm" is done for a similar reason: the code does

hide.year = tm->tm_year == human_tm->tm_year;

(and then later just checks "if (human_tm->tm_year)") knowing that a
non-zero tm_year will only ever happen for human_tz (and that 1900 is
not a valid git date, even though I guess in theory you could do it).

   Linus


[RFC PATCH v2] Add 'human' date format

2018-07-07 Thread Linus Torvalds


From: Linus Torvalds 

This adds --date=human, which skips the timezone if it matches the
current time-zone, and doesn't print the whole date if that matches (ie
skip printing year for dates that are "this year", but also skip the
whole date itself if it's in the last few days and we can just say what
weekday it was).

For really recent dates (same day), use the relative date stamp, while
for old dates (year doesn't match), don't bother with time and timezone.

Also add 'auto' date mode, which defaults to human if we're using the
pager.  So you can do

git config --add log.date auto

and your "git log" commands will show the human-legible format unless
you're scripting things.

Note that this time format still shows the timezone for recent enough
events (but not so recent that they show up as relative dates).  You can
combine it with the "-local" suffix to never show timezones for an even
more simplified view.

Signed-off-by: Linus Torvalds 
---

Slightly updated version after playing with this more. 

This tries to make the length somewhat more consistent (and shorter), 
which came about when looking at this in "git blame" output. 

Once you're talking "last year" patches, you don't tend to care about time 
of day or timezone. So the longest date is basically "Thu Oct 19 16:00", 
because if you show the year (four characters), you don't show the time 
(five characters). And the timezone (five characters) is only shown if not 
showing the date (5-6 characters).

Also, because the relative time is now handled entirely inside the 
show_date_normal() function, I could undo some of the changes to 
show_date() that were updating date->mode and date->local. So the patch 
has actually shrunk a bit, I think.

 builtin/blame.c |   4 ++
 cache.h |   1 +
 date.c  | 130 
 3 files changed, 115 insertions(+), 20 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 5a0388aae..7b6235321 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -917,6 +917,10 @@ int cmd_blame(int argc, const char **argv, const char 
*prefix)
 */
blame_date_width = utf8_strwidth(_("4 years, 11 months ago")) + 
1; /* add the null */
break;
+   case DATE_HUMAN:
+   /* If the year is shown, no time is shown */
+   blame_date_width = sizeof("Thu Oct 19 16:00");
+   break;
case DATE_NORMAL:
blame_date_width = sizeof("Thu Oct 19 16:00:04 2006 -0700");
break;
diff --git a/cache.h b/cache.h
index d49092d94..8a6810ee6 100644
--- a/cache.h
+++ b/cache.h
@@ -1428,6 +1428,7 @@ extern struct object *peel_to_type(const char *name, int 
namelen,
 struct date_mode {
enum date_mode_type {
DATE_NORMAL = 0,
+   DATE_HUMAN,
DATE_RELATIVE,
DATE_SHORT,
DATE_ISO8601,
diff --git a/date.c b/date.c
index 49f943e25..4486c028a 100644
--- a/date.c
+++ b/date.c
@@ -77,22 +77,16 @@ static struct tm *time_to_tm_local(timestamp_t time)
 }
 
 /*
- * What value of "tz" was in effect back then at "time" in the
- * local timezone?
+ * Fill in the localtime 'struct tm' for the supplied time,
+ * and return the local tz.
  */
-static int local_tzoffset(timestamp_t time)
+static int local_time_tzoffset(time_t t, struct tm *tm)
 {
-   time_t t, t_local;
-   struct tm tm;
+   time_t t_local;
int offset, eastwest;
 
-   if (date_overflows(time))
-   die("Timestamp too large for this system: %"PRItime, time);
-
-   t = (time_t)time;
-   localtime_r(, );
-   t_local = tm_to_time_t();
-
+   localtime_r(, tm);
+   t_local = tm_to_time_t(tm);
if (t_local == -1)
return 0; /* error; just use + */
if (t_local < t) {
@@ -107,6 +101,20 @@ static int local_tzoffset(timestamp_t time)
return offset * eastwest;
 }
 
+/*
+ * What value of "tz" was in effect back then at "time" in the
+ * local timezone?
+ */
+static int local_tzoffset(timestamp_t time)
+{
+   struct tm tm;
+
+   if (date_overflows(time))
+   die("Timestamp too large for this system: %"PRItime, time);
+
+   return local_time_tzoffset((time_t)time, );
+}
+
 void show_date_relative(timestamp_t time, int tz,
   const struct timeval *now,
   struct strbuf *timebuf)
@@ -191,9 +199,80 @@ struct date_mode *date_mode_from_type(enum date_mode_type 
type)
return 
 }
 
+static void show_date_normal(struct strbuf *buf, timestamp_t time, struct tm 
*tm, int tz, struct tm *human_tm, int human_tz, int local)
+{
+   struct {
+   unsigned intyear:1,
+   date:1,

Re: [RFC PATCH] Add 'human' date format

2018-07-07 Thread Linus Torvalds
On Sat, Jul 7, 2018 at 12:58 PM Linus Torvalds
 wrote:
>
> I'm playing with making all "today" dates just use the relative
> format.

Here's the incremental patch for that if people want to compare the output.

With this, you never get the "just time" case, because that will turn
into "2 hours ago" or similar. But you will get "Fri 19:45" for
something that happened yesterday.

So examples from my kernel logs look something like this:

  2 hours ago
  Fri 19:45
  Fri 10:44 +1000
  Fri Jun 22 15:46
  Tue Jun 19 15:41 -0600
  Thu Jun 15 12:57 2017 +0300

depending on how long ago they were and whether they were in the same
timezone etc.

  Linus
 date.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/date.c b/date.c
index 9809ac334..de0b03cf4 100644
--- a/date.c
+++ b/date.c
@@ -199,7 +199,7 @@ struct date_mode *date_mode_from_type(enum date_mode_type type)
 	return 
 }
 
-static void show_date_normal(struct strbuf *buf, struct tm *tm, int tz, struct tm *human_tm, int human_tz, int local)
+static void show_date_normal(struct strbuf *buf, timestamp_t time, struct tm *tm, int tz, struct tm *human_tm, int human_tz, int local)
 {
 	struct {
 		unsigned int	year:1,
@@ -225,6 +225,14 @@ static void show_date_normal(struct strbuf *buf, struct tm *tm, int tz, struct t
 		}
 	}
 
+	/* Show "today" times as just relative times */
+	if (hide.wday) {
+		struct timeval now;
+		gettimeofday(, NULL);
+		show_date_relative(time, tz, , buf);
+		return;
+	}
+
 	/* Always hide seconds for human-readable */
 	hide.seconds = human_tm->tm_year > 0;
 
@@ -268,10 +276,6 @@ const char *show_date(timestamp_t time, int tz, const struct date_mode *mode)
 		/* Fill in the data for "current time" in human_tz and human_tm */
 		human_tz = local_time_tzoffset(now.tv_sec, _tm);
 
-		/* Special case: if it's less than an hour ago, use relative time */
-		if (time - now.tv_sec < 60 * 60)
-			type = DATE_RELATIVE;
-
 		/* Don't print timezone if it matches */
 		if (tz == human_tz)
 			local = 1;
@@ -333,7 +337,7 @@ const char *show_date(timestamp_t time, int tz, const struct date_mode *mode)
 		strbuf_addftime(, mode->strftime_fmt, tm, tz,
 !local);
 	else
-		show_date_normal(, tm, tz, _tm, human_tz, local);
+		show_date_normal(, time, tm, tz, _tm, human_tz, local);
 	return timebuf.buf;
 }
 


Re: [RFC PATCH] Add 'human' date format

2018-07-07 Thread Linus Torvalds
On Sat, Jul 7, 2018 at 12:39 PM Linus Torvalds
 wrote:
> to me, but with "--date=human", right now it just says
>
> Date:   12:21

Side note: this is probably my least favorite of the formats.

I'm playing with making all "today" dates just use the relative
format, and then the "a couple of days ago" dates would then have the

> Date:   Fri 19:45

format.

But since it's _explicitly_ about a "human legible" format, I think
the format could be a bit fluid, and things like that might be tweaked
later. Anybody who would script this would be crazy.

I'm more looking for "no, that's just stupid" comments, or "no, you're
not the only one who has wanted this" kinds of replies.

  Linus


[RFC PATCH] Add 'human' date format

2018-07-07 Thread Linus Torvalds


From: Linus Torvalds 

This adds --date=human, which skips the timezone if it matches the
current time-zone, and doesn't print the whole date if that matches (ie
skip printing year for dates that are "this year", but also skip the
whole date itself if it's in the last few days and we can just say what
weekday it was).

For really recent dates (within the last hour), use the relative date
stamp.

Also add 'auto' date mode, which defaults to human if we're using the
pager.  So you can do

git config --add log.date auto

and your "git log" commands will show the human-legible format unless
you're scripting things.

Signed-off-by: Linus Torvalds 
---

Ok, I tried something like this long long ago, but that was a fairly nasty 
patch that just defaulted to "--date=relative" for recent dates, and then 
did the usual default for anything else.

But the issue kept bugging me, and this is a slightly more sane model (?). 
It keeps the notion of "let's use --date=relative for very recent dates", 
but for anything that is more than an hour old, it just uses a simplified 
date.

For example, for time stamps that are "today", just show the time. And if 
the year or the time zone matches the current year or timezone, skip them.  
And don't bother showing seconds, because humans won't care.

The end result is a slightly simplified view.

So for example, the date for this commit right now looks like

Date:   Sat Jul 7 12:21:26 2018 -0700

to me, but with "--date=human", right now it just says

Date:   12:21

(and maybe for a US locale, I should make it do the AM/PM thing).

This is marked RFC because

 (a) maybe I'm the only one who has ever wanted the simplified dates

 (b) the simplification rules themselves might be worthy of discussion.

For example, I also simplified the "a couple of days ago" case, so that it 
shows

Date:   Fri 19:45

for one of my kernel commits that happened yesterday. The date didn't 
match _exactly_, but it's the same month, and just a few days ago, so it 
just shows the weekday. Is that easier to read? Maybe. I kind of like it.


 builtin/blame.c |   1 +
 cache.h |   1 +
 date.c  | 139 +---
 3 files changed, 110 insertions(+), 31 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 5a0388aae..27c64b1c8 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -917,6 +917,7 @@ int cmd_blame(int argc, const char **argv, const char 
*prefix)
 */
blame_date_width = utf8_strwidth(_("4 years, 11 months ago")) + 
1; /* add the null */
break;
+   case DATE_HUMAN:
case DATE_NORMAL:
blame_date_width = sizeof("Thu Oct 19 16:00:04 2006 -0700");
break;
diff --git a/cache.h b/cache.h
index d49092d94..8a6810ee6 100644
--- a/cache.h
+++ b/cache.h
@@ -1428,6 +1428,7 @@ extern struct object *peel_to_type(const char *name, int 
namelen,
 struct date_mode {
enum date_mode_type {
DATE_NORMAL = 0,
+   DATE_HUMAN,
DATE_RELATIVE,
DATE_SHORT,
DATE_ISO8601,
diff --git a/date.c b/date.c
index 49f943e25..9809ac334 100644
--- a/date.c
+++ b/date.c
@@ -77,22 +77,16 @@ static struct tm *time_to_tm_local(timestamp_t time)
 }
 
 /*
- * What value of "tz" was in effect back then at "time" in the
- * local timezone?
+ * Fill in the localtime 'struct tm' for the supplied time,
+ * and return the local tz.
  */
-static int local_tzoffset(timestamp_t time)
+static int local_time_tzoffset(time_t t, struct tm *tm)
 {
-   time_t t, t_local;
-   struct tm tm;
+   time_t t_local;
int offset, eastwest;
 
-   if (date_overflows(time))
-   die("Timestamp too large for this system: %"PRItime, time);
-
-   t = (time_t)time;
-   localtime_r(, );
-   t_local = tm_to_time_t();
-
+   localtime_r(, tm);
+   t_local = tm_to_time_t(tm);
if (t_local == -1)
return 0; /* error; just use + */
if (t_local < t) {
@@ -107,6 +101,20 @@ static int local_tzoffset(timestamp_t time)
return offset * eastwest;
 }
 
+/*
+ * What value of "tz" was in effect back then at "time" in the
+ * local timezone?
+ */
+static int local_tzoffset(timestamp_t time)
+{
+   struct tm tm;
+
+   if (date_overflows(time))
+   die("Timestamp too large for this system: %"PRItime, time);
+
+   return local_time_tzoffset((time_t)time, );
+}
+
 void show_date_relative(timestamp_t time, int tz,
   const struct timeval *now,
   struct strbuf *timebuf)
@@ -191,27 +199,94 @@ struct date_mode *date_mode_from_type(enum date_mode_type 
type)
return 
 }
 
+static void show_date_no

Re: Hash algorithm analysis

2018-06-11 Thread Linus Torvalds
On Mon, Jun 11, 2018 at 4:27 PM Ævar Arnfjörð Bjarmason
 wrote:
> >
> > And no, I'm not a cryptographer. But honestly, length extension
> > attacks were how both md5 and sha1 were broken in practice, so I'm
> > just going "why would we go with a crypto choice that has that known
> > weakness? That's just crazy".
>
> What do you think about Johannes's summary of this being a non-issue for
> Git in
> https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/
> ?

I agree that the fact that git internal data is structured and all
meaningful (and doesn't really have ignored state) makes it *much*
harder to attack the basic git objects, since you not only have to
generate a good hash, the end result has to also *parse* and there is
not really any hidden non-parsed data that you can use to hide the
attack.

And *if* you are using git for source code, the same is pretty much
true even for the blob objects - an attacking object will stand out
like a sore thumb in "diff" etc.

So I don't disagree with Johannes in that sense: I think git does
fundamentally tend to have some extra validation in place, and there's
a reason why the examples for both the md5 and the sha1 attack were
pdf files.

That said, even if git internal ("metadata") objects like trees and
commits tend to not have opaque parts to them and are thus pretty hard
to attack, the blob objects are still an attack vector for projects
that use git for non-source-code (and even source projects do embed
binary files - including pdf files - even though they might not be "as
interesting" to attack). So you do want to protect those too.

And hey, protecting the metadata objects is good just to protect
against annoyances. Sure, you should always sanity check the object at
receive time anyway, but even so, if somebody is able to generate a
blob object that hashes to the same hash as a metadata object (ie tree
or commit), that really could be pretty damn annoying.

And the whole "intermediate hashed state is same size as final hash
state" just _fundamentally_ means that if you find a weakness in the
hash, you can now attack that weakness without having to worry about
the attack being fundamentally more expensive.

That's essentially what SHAttered relied on. It didn't rely on a
secret and a hash and length extension, but it *did* rely on the same
mechanism that a length extension attack relies on, where you can
basically attack the state in the middle with no extra cost.

Maybe some people don't consider it a length extension attack for that
reason, but it boils down to much the same basic situation where you
can attack the internal hash state and cause a state collision. And
you can try to find the patterns that then cause that state collision
when you've found a weakness in the hash.

With SHA3 or k12, you can obviously _also_ try to attack the hash
state and cause a collision, but because the intermediate state is
much bigger than the final hash, you're just making things *way*
harder for yourself if you try that.

  Linus


Re: Hash algorithm analysis

2018-06-11 Thread Linus Torvalds
On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder  wrote:
>
> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
> it is constructed.

Yeah, I really think that it's a mistake to switch to something that
has the same problem SHA1 had.

That doesn't necessarily mean SHA3, but it does mean "bigger
intermediate hash state" (so no length extension attack), which could
be SHA3, but also SHA-512/256 or K12.

Honestly, git has effectively already moved from SHA1 to SHA1DC.

So the actual known attack and weakness of SHA1 should simply not be
part of the discussion for the next hash. You can basically say "we're
_already_ on the second hash, we just picked one that was so
compatible with SHA1 that nobody even really noticed.

The reason to switch is

 (a) 160 bits may not be enough

 (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch.

 (c) others?

Obviously all of the choices address (a).

But at least for me, (b) makes me go "well, SHA2 has the exact same
weak inter-block state attack, so if there are unknown weaknesses in
SHA1, then what about unknown weaknesses in SHA2"?

And no, I'm not a cryptographer. But honestly, length extension
attacks were how both md5 and sha1 were broken in practice, so I'm
just going "why would we go with a crypto choice that has that known
weakness? That's just crazy".

>From a performance standpoint, I have to say (once more) that crypto
performance actually mattered a lot less than I originally thought it
would. Yes, there are phases that do care, but they are rare.

For example, I think SHA1 performance has probably mattered most for
the index and pack-file, where it's really only used as a fancy CRC.
For most individual object cases, it is almost never an issue.

>From a performance angle, I think the whole "256-bit hashes are
bigger" is going to be the more noticeable performance issue, just
because things like delta compression and zlib - both of which are
very *real* and present performance issues - will have more data that
they need to work on. The performance difference between different
hashing functions is likely not all that noticeable in most common
cases as long as we're not talking orders of magnitude.

And yes, I guess we're in the "approaching an order of magnitude"
performance difference, but we should actually compare not to OpenSSL
SHA1, but to SHA1DC. See above.

Personally, the fact that the Keccak people would suggest K12 makes me
think that should be a front-runner, but whatever. I don't think the
128-bit preimage case is an issue, since 128 bits is the brute-force
cost for any 256-bit hash.

But hey, I picked sha1 to begin with, so take any input from me with
that historical pinch of salt in mind ;)

Linus


Re: [RFC] git gc "--prune=now" semantics considered harmful

2018-06-01 Thread Linus Torvalds
On Fri, Jun 1, 2018 at 2:04 AM Jeff King  wrote:
>
> We'd also accept relative times like "5.minutes.ago" (in fact, the
> default is a relative 2.weeks.ago, though it's long enough that the
> difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter
> much). So we probably ought to just normalize _everything_ without even
> bothering to match "now". It's a noop for non-relative times, but that's
> OK.

Heh. That's what I tried to do at first.

It's surprisingly hard.

You can't normalize it as a date, because we have a few times that
aren't expressible as dates because they are just the maximum value
(ie "all").

And then I tried to just express it just as a standard numerical
value, which we accept on input. But we *only* accept that if it's
more than eight digits.

And regardless, you need to special-case "now", since
parse_expiry_date() special cases it.

Or you'd need to just make another version of parse_expiry_date() entirely.

End result: I only special-cased "now".

> There are still possibilities for a race, even with the grace period.

Agreed. But you *really* have to work at it.

  Linus


Re: [RFC] git gc "--prune=now" semantics considered harmful

2018-05-26 Thread Linus Torvalds
On Sat, May 26, 2018 at 4:31 PM Junio C Hamano  wrote:

> *That* is something I don't do.  After all, I am fully aware that I
> have started end-of-day ritual by that time, so I won't even look at
> a new patch (or a pull request for that matter).

Sounds like you're more organized about the end-of-day ritual than I am.
For me the gc is not quite so structured.

> I however have to wonder if there are opposite "oops" end-user
> operation we also need to worry about, i.e. we are doing a large-ish
> fetch, and get bored and run a gc fron another terminal.  Perhaps
> *that* is a bit too stupid to worry about?  Auto-gc deliberately
> does not use 'now' because it wants to leave a grace period to avoid
> exactly that kind of race.

For me, a "pull" never takes that long.  Sure, any manual merging and the
writing of the commit message might take a while, but it's "foreground"
activity for me, I'd not start a gc in the middle of it.

So at least to me, doing "git fsck --full" and "git gc --prune=now" are
somewhat special because they take a while and tend to be background things
that I "start and forget" about (the same way I sometimes start and forget
a kernel build).

Which is why that current "git gc --prune=now" behavior seems a bit
dangerous.

Linus


[RFC] git gc "--prune=now" semantics considered harmful

2018-05-26 Thread Linus Torvalds

So this is a RFC patch, I'm not sure how much people really care, but I 
find the current behavior of "git gc --prune=now" to be unnecessarily 
dangerous.

There's two issues with it:

 (a) parse_expiry_date() considers "now" to be special, and it actually
 doesn't mean "now" at all, it means "everything".

 (b) the date parsing isn't actually done "now", it's done *after* gc has 
 already run, and we run "git prune --expire". So even if (a) wasn't 
 true, "--prune=now" wouldn't actually mean "now" when the user 
expects it to happen, but "after doing repacking".

I actually think that the "parse_expiry_date()" behavior makes sense 
within the context of "git prune --expire", so I'm not really complaining 
about (a) per se. I just think that what makes sense within the context of 
"git prune" does *not* necessarily make sense within the context of "git 
gc".

Why do I care? I end up doing lots of random things in my local 
repository, and I also end up wanting to keep my repository fairly clean, 
so I tend to do "git gc --prune=now" to just make sure everything is 
packed and I've gotten rid of all the temporary stuff that so often 
happens when doing lots of git merges (which is what I do). 

You won't see those temporary objects for the usual trivial merges, but 
whenever you have a real recursive merge with automated conflict 
resolution, there will be things like those temporary merge-only objects 
for the 3-way base merge state. 

Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But 
it's what I've gotten used to, and it's at least not entirely insane.

But at least once now, I've done that "git gc" at the end of the day, and 
a new pull request comes in, so I do the "git pull" without even thinking 
about the fact that "git gc" is still running.

And then the "--prune=now" behavior is actually really pretty dangerous. 
Because it will prune *all* unreachable objects, even if they are only 
*currently* unreachable because they are in the process of being unpacked 
by the concurrent "git fetch" (and I didn't check - I might just have been 
unlocky, bit I think "git prune" ignores FETCH_HEAD).

So I actually would much prefer that foir git gc, "--prune=now" means

 (a) "now"

 (b) now at the _start_ of the "git gc" operation, not the time at
 the _end_ of the operation when we've already spent a minute or
 two doing repacking and are now doing the final pruning.

anyway, with that explanation in mind, I'm appending a patch that is 
pretty small and does that. It's a bit hacky, but I think it still makes 
sense.

Comments?

Note that this really isn't likely very noticeable on most projects. When 
I do "git gc" on a fairly well-packed repo of git itself, it takes under 
4s for me. So the window for that whole "do git pull at the same time" is 
simply not much of an issue.

For the kernel, "git gc" takes a minute and a half on the same machine 
(again, this is already a packed repo, it can be worse). So there's a much 
bigger window there to do something stupid,

 Linus

---
 builtin/gc.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c4777b244..98368c8b5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (argc > 0)
usage_with_options(builtin_gc_usage, builtin_gc_options);
 
-   if (prune_expire && parse_expiry_date(prune_expire, ))
-   die(_("failed to parse prune expiry value %s"), prune_expire);
+   if (prune_expire) {
+   if (!strcmp(prune_expire, "now"))
+   prune_expire = show_date(time(NULL), 0, 
DATE_MODE(ISO8601));
+   if (parse_expiry_date(prune_expire, ))
+   die(_("failed to parse prune expiry value %s"), 
prune_expire);
+   }
 
if (aggressive) {
argv_array_push(, "-f");


Re: Silly "git gc" UI issue.

2018-04-18 Thread Linus Torvalds
On Wed, Apr 18, 2018 at 7:16 PM, Junio C Hamano  wrote:
> A few commands that parse --expire= command line option
> behaves silly when given nonsense input.  For example

So this patch definitely improves on the error message.

But look at what happens for the kernel:

[torvalds@i7 linux]$ time git gc --prune=npw
Counting objects: 6006319, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (912166/912166), done.
Writing objects: 100% (6006319/6006319), done.
Total 6006319 (delta 5050577), reused 6006319 (delta 5050577)
fatal: malformed expiration date 'npw'
error: failed to run prune

real1m4.376s
user0m59.963s
sys 0m5.182s



Yes, I get that nice "malformed expiration date 'npw'" error, but I
get it after 64 seconds has passed.

So i think builtin/gc.c should use this same parse_expiry_date()
parse_opt_expiry_date_cb() thing for its timestamp parsing.

It does actually seem to do that for the gc_log_expire value that it
loads from the config file.

Maybe something like the attached patch? Then I get:

[torvalds@i7 linux]$ time git gc --prune=npw
fatal: Failed to parse prune expiry value npw

real0m0.004s
user0m0.002s
sys 0m0.002s

and you could smush it into your commit (if you want my sign-off, take it)

  Linus
 builtin/gc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index 3e67124ea..a4b20aaaf 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -354,6 +354,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	const char *name;
 	pid_t pid;
 	int daemonized = 0;
+	timestamp_t dummy;
 
 	struct option builtin_gc_options[] = {
 		OPT__QUIET(, N_("suppress progress reporting")),
@@ -392,6 +393,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (argc > 0)
 		usage_with_options(builtin_gc_usage, builtin_gc_options);
 
+	if (parse_expiry_date(prune_expire, ))
+		die(_("Failed to parse prune expiry value %s"), prune_expire);
+
 	if (aggressive) {
 		argv_array_push(, "-f");
 		if (aggressive_depth > 0)


Re: Silly "git gc" UI issue.

2018-04-18 Thread Linus Torvalds
On Wed, Apr 18, 2018 at 6:52 PM, Junio C Hamano  wrote:
>
> Regardless of your originai "git gc" issue, we should make "prune"
> say something on this error.  And when we do so, I would think that
> error message will come before the final "error: failed to run
> prune".

So to me, the real failure is the fact that it spent a a lot of time
packing my repository before it then failed the prune at the end.

I don't actually mind the quality of the error message too much -
although it could be improved.

I mind the "oh, goddamnit, you just spent over a minute of CPU time on
my fairly high-end desktop, and _then_ you decided to tell me that I'm
a moron and couldn't type 'now' correctly".

So to me, the big deal would be that builtin/gc.c should validate the
date *before* it starts, instead of doing all that work, and then
executing "git prune" with invalid arguments..

Linus


Silly "git gc" UI issue.

2018-04-18 Thread Linus Torvalds
Ok, this is ridiculous, but I've done it several times, so I thought
I'd finally mention it to somebody on the git list that may care:

  "My name is Linus, and I'm a klutz".

what does that have to do with anything?

Now, imagine you're a klutz. Imagine you want to clean up your .git
directory. Combine those things, and what do you get?

You get this:

   git gc --prune=npw

Yeah, that "npw" should be "now", which is where the klutz thing comes in.

It turns out that git reacts ridiculously badly to this.

I'm just assuming that everybody else is scarily competent if I'm the
first to have reported this.

  Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-15 Thread Linus Torvalds
On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano  wrote:
>
> I think Elijah's corrected was_tracked() also does not care "has
> this been renamed".

I'm perfectly happy with the slightly smarter patches. My patch was
really just an RFC and because I had tried it out.

> One thing that makes me curious is what happens (and what we want to
> happen) when such a "we already have the changes the side branch
> tries to bring in" path has local (i.e. not yet in the index)
> changes.  For a dirty file that trivially merges (e.g. a path we
> modified since our histories forked, while the other side didn't do
> anything, has local changes in the working tree), we try hard to
> make the merge succeed while keeping the local changes, and we
> should be able to do the same in this case, too.

I think it might be nice, but probably not really worth it.

I find the "you can merge even if some files are dirty" to be really
convenient, because I often keep stupid test patches in my tree that I
may not even intend to commit, and I then use the same tree for
merging.

For example, I sometimes end up editing the Makefile for the release
version early, but I won't *commit* that until I actually cut the
release. But if I pull some branch that has also changed the Makefile,
it's not worth any complexity to try to be nice about the dirty state.

If it's a file that actually *has* been changed in the branch I'm
merging, and I'm more than happy to just stage the patch (or throw it
away - I think it's about 50:50 for me).

So I don't think it's a big deal, and I'd rather have the merge fail
very early with "that file has seen changes in the branch you are
merging" than add any real complexity to the merge logic.

Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-13 Thread Linus Torvalds
On Fri, Apr 13, 2018 at 10:39 AM, Stefan Beller  wrote:
>
> Would s/read/xread/ make sense in working_tree_matches ?

Makes sense, yes.

That patch was really more of a RFD than anything that should be applied.

I would like to see the "might be same" flag pushed down so that we
don't do this comparison when it makes no sense, even if the stat info
makes that less of an issue than it might otherwise be,

And maybe that whole function should be taught to do the "verify CE
against file", and do the proper full cache state check (ie time etc)
instead of doing the read.

So maybe the best option is a combination of the two patches: my
"verify the working tree state" _together_ with the was_tracked()
logic that looks up a cache entry.

Again, the problem with "look up the index state" is about files
possibly being renamed as part of the merge, but if we then check the
index state against the actual contents of the working directory, that
isn't an issue.

  Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-13 Thread Linus Torvalds
On Fri, Apr 13, 2018 at 12:02 AM, Elijah Newren  wrote:
>
> I hope you don't mind me barging into your conversation

I was getting tired of my own rambling anyway..

> However, it turns out we have this awesome function called
> "was_tracked(const char *path)" that was intended for answering this
> exact question.  So, assuming was_tracked() isn't buggy, the correct
> patch for this problem would look like:

Apparently that causes problems, for some odd reason.

I like the notion of checking the index, but it's not clear that the
index is reliable in the presence of renames either.

>   A big series
> including that patch was merged to master two days ago, but
> unfortunately that exact patch was the one that caused some
> impressively awful fireworks[1].

Yeah, so this code is fragile.

How about we take a completely different approach? Instead of relying
on fragile (but clever) tests, why not rely on stupid brute force?

Yeah, yeah, it's bad to be stupid, but sometimes simple and stupid
really does work.

See the attached patch. It gets rid of all the subtle "has this been
renamed" tests entirely, and just _always_ does that final
update_file().

But instead, it makes update_file() a bit smarter, and says "before we
write this file out, let's see if it's already there and already has
the expected contents"?

Now, it really shouldn't be _quite_ as stupid as that: we should
probably set a flag in the "hey, the oid matches, maybe it's worth
checking", so that it doesn't do the check in the cases where we know
the merge has done things.

But it's actually *fairly* low cost, because before it reads the file
it at least checks that file length matches the expected length (and
that the user permission bits match the expected mode).

So if the file doesn't match, most of the time the real cost will just
be an extra 'open/fstat/close()' sequence. That's pretty cheap.

So even the completely stupid approach is probably not too bad, and it
could be made smarter.

NOTE! I have *NOT* tested this on anything interesting. I tested the
patch on my stupid test-case, but that'[s it. I didn't even bother
re-doing the kernel merge that started this.

Comments? Because considering the problems this code has had, maybe
"stupid" really is the right approach...

[ Ok, I lied. I just tested this on the kernel merge. It worked fine,
and avoided modifying  ]

 Linus
 merge-recursive.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 0c0d48624..ed2200065 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -815,6 +815,32 @@ static int make_room_for_path(struct merge_options *o, const char *path)
 	return err(o, msg, path, _(": perhaps a D/F conflict?"));
 }
 
+static int working_tree_matches(const char *path, const char *buf, unsigned long size, unsigned mode)
+{
+	int fd, matches;
+	struct stat st;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return 0;
+	matches = 0;
+	if (!fstat(fd, ) && st.st_size == size && S_ISREG(st.st_mode) && !(0700 & (st.st_mode ^ mode))) {
+		char tmpbuf[1024];
+		while (size) {
+			int n = read(fd, tmpbuf, sizeof(tmpbuf));
+			if (n <= 0 || n > size)
+break;
+			if (memcmp(tmpbuf, buf, n))
+break;
+			buf += n;
+			size -= n;
+		}
+		matches = !size;
+	}
+	close(fd);
+	return matches;
+}
+
 static int update_file_flags(struct merge_options *o,
 			 const struct object_id *oid,
 			 unsigned mode,
@@ -856,6 +882,8 @@ static int update_file_flags(struct merge_options *o,
 size = strbuf.len;
 buf = strbuf_detach(, NULL);
 			}
+			if (working_tree_matches(path, buf, size, mode))
+goto free_buf;
 		}
 
 		if (make_room_for_path(o, path) < 0) {
@@ -1782,20 +1810,8 @@ static int merge_content(struct merge_options *o,
 
 	if (mfi.clean && !df_conflict_remains &&
 	oid_eq(, a_oid) && mfi.mode == a_mode) {
-		int path_renamed_outside_HEAD;
 		output(o, 3, _("Skipped %s (merged same as existing)"), path);
-		/*
-		 * The content merge resulted in the same file contents we
-		 * already had.  We can return early if those file contents
-		 * are recorded at the correct path (which may not be true
-		 * if the merge involves a rename).
-		 */
-		path_renamed_outside_HEAD = !path2 || !strcmp(path, path2);
-		if (!path_renamed_outside_HEAD) {
-			add_cacheinfo(o, mfi.mode, , path,
-  0, (!o->call_depth), 0);
-			return mfi.clean;
-		}
+		/* We could set a flag here and pass it to "update_file()" */
 	} else
 		output(o, 2, _("Auto-merging %s"), path);
 


Re: Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
[ Still talking to myself. Very soothing. ]

On Thu, Apr 12, 2018 at 4:55 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
> [ Talking to myself ]
>
> Did it perhaps mean to say
>
> path_renamed_outside_HEAD = path2 && !strcmp(path, path2);
>
> instead?

Probably not correct, but making that change makes my test-case work.

It probably breaks something important, though. I didn't look at why
that code happened in the first place.

But it's nice to know I'm at least looking at the right code while I'm
talking to myself.

   Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
[ Talking to myself ]

On Thu, Apr 12, 2018 at 4:41 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Oddly, that *already* has the check:
>
> if (mfi.clean && !df_conflict_remains &&
> oid_eq(, a_oid) && mfi.mode == a_mode) {
> int path_renamed_outside_HEAD;
> output(o, 3, _("Skipped %s (merged same as existing)"), path);
>
> but that doesn't seem to actually trigger for some reason.

Actually, that check triggers just fine.

> But the code really seems to have the _intention_ of skipping the case
> where the result ended up the same as the source.
>
> Maybe I'm missing something.

The later check that does

/*
 * The content merge resulted in the same file contents we
 * already had.  We can return early if those file contents
 * are recorded at the correct path (which may not be true
 * if the merge involves a rename).
 */
path_renamed_outside_HEAD = !path2 || !strcmp(path, path2);
if (!path_renamed_outside_HEAD) {

will see that 'path2' is NULL, and not trigger this early out case,
and then this all falls back to the normal cases after all.

So I think that 'path_renamed_outside_HEAD' logic is somehow broken.

Did it perhaps mean to say

path_renamed_outside_HEAD = path2 && !strcmp(path, path2);

instead?

See commit 5b448b853 ("merge-recursive: When we detect we can skip an
update, actually skip it") which really implies we want to actually
skip it (but then we don't anyway).

Also see commit b2c8c0a76 ("merge-recursive: When we detect we can
skip an update, actually skip it") which was an earlier version, and
which *actually* skipped it, but it was reverted because of that
rename issue.

Adding Elijah Newren to the cc, because he was working on this back
then, an dis still around, and still working on merging ;)

   Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 4:35 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> in process_entry(), and I think  we could just there add a test for if
> o_old,o_mod == a_oid,a_mode or something?

Actually, not process_entry, but merge_content().

Oddly, that *already* has the check:

if (mfi.clean && !df_conflict_remains &&
oid_eq(, a_oid) && mfi.mode == a_mode) {
int path_renamed_outside_HEAD;
output(o, 3, _("Skipped %s (merged same as existing)"), path);

but that doesn't seem to actually trigger for some reason.

But the code really seems to have the _intention_ of skipping the case
where the result ended up the same as the source.

Maybe I'm missing something.

 Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 4:17 PM, Junio C Hamano  wrote:
>
> A bit of detour.  "Change in side branch happened to be a subset of
> the change in trunk and gets subsumed, but we end up writing the
> same result" happens also with the simpler resolve strategy.
>
> Here is a fix.

Heh. Except git-merge-one-file.sh is *only* used for that case, so
this doesn't fix the normal case.

I verified that the normal case situation is that
"update_file_flags()" thing that checks out the end result.

It's called by this case:

} else if (a_oid && b_oid) {
/* Case C: Added in both (check for same permissions) and */
/* case D: Modified in both, but differently. */
clean_merge = merge_content(o, path,
o_oid, o_mode, a_oid,
a_mode, b_oid, b_mode,
NULL);

in process_entry(), and I think  we could just there add a test for if
o_old,o_mod == a_oid,a_mode or something?

  Linus


Re: Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 2:46 PM, Junio C Hamano  wrote:
>
> Thanks for a clear description of the issue.  It does sound
> interesting.

I decided to show it with a simpler case that could be scripted and
doesn't need the kernel.

NOTE! This obviously doesn't happen for files with the trivial merge
cases (ie the ones that don't require any real merging at all).

There *is* a three-way merge going on, it's just that the end result
is the same as the original file.

Example script just appended here at the end.

NOTE: The script uses "ls -l --full-time", which afaik is a GNU ls
extension. So the script below is not portable.

 Linus

---

  # Create throw-away test repository
  mkdir merge-test
  cd merge-test
  git init

  # Create silly baseline file with 10 lines of numbers in it
  for i in $(seq 1 10); do echo $i; done > a
  git add a
  git commit -m"Original"

  # Make a branch that changes '9' to 'nine'
  git checkout -b branch
  sed -i 's/9/nine/' a
  git commit -m "Nine" a

  # On the master, change '2' to 'two' _and_ '9' to 'nine'
  git checkout master
  sed -i 's/9/nine/' a
  sed -i 's/2/two/' a
  git commit -m "Two and nine" a

  # sleep to show the time difference
  sleep 1

  # show the date on 'a' and do the merge
  ls -l --full-time a
  git merge -m "Merge contents" branch

  # The merge didn't change the contents, but did rewrite the file
  ls -l --full-time a


Optimizing writes to unchanged files during merges?

2018-04-12 Thread Linus Torvalds
So I just had an interesting experience that has happened before too,
but this time I decided to try to figure out *why* it happened.

I'm obviously in the latter part of the kernel merge window, and
things are slowly calming down. I do the second XFS merge during this
window, and it brings in updates just to the fs/xfs/ subdirectory, so
I expect that my test build for the full kernel configuration should
be about a minute.

Instead of recompiles pretty much *everything*, and the test build
takes 22 minutes.

This happens occasionally, and I blame gremlins. But this time I
decided to look at what the gremlins actually *are*.

The diff that git shows for the pull was very clear: only fs/xfs/
changes. But doing

  ls -tr $(git ls-files '*.[chS]') | tail -10

gave the real reason: in between all the fs/xfs/xyz files was this:

include/linux/mm.h

and yeah, that rather core header file causes pretty much everything
to be re-built.

Now, the reason it was marked as changed is that the xfs branch _had_
in fact changed it, but the changes were already upstream and got
merged away. But the file still got written out (with the same
contents it had before the merge), and 'make' obviously only looks at
modification time, so make rebuilt everything.

Now, because it's still the merge window, I don't have much time to
look into this, but I was hoping somebody in git land would like to
give it a quick look. I'm sure I'm not the only one to have ever been
hit by this, and I don't think the kernel is the only project to hit
it either.

Because it would be lovely if the merging logic would just notice "oh,
that file doesn't change", and not even write out the end result.

For testing, the merge that triggered this git introspection is kernel
commit 80aa76bcd364 ("Merge tag 'xfs-4.17-merge-4' of
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux"), which can act as a
test-case. It's a clean merge, so no kernel knowledge necessary: just
re-merge the parents and see if the modification time of
include/linux/mm.h changes.

I'm guessing some hack in update_file_flags() to say "if the new
object matches the old one, don't do anything" might work. Although I
didn't look if we perhaps end up writing to the working tree copy
earlier already.

Looking at the blame output of that function, most of it is really
old, so this "problem" goes back a long long time.

Just to clarify: this is absolutely not a correctness issue. It's not
even a *git* performance issue. It's literally just a "not updating
the working tree when things don't change results in better behavior
for other tools".

So if somebody gets interested in this problem, that would be great.
And if not, I'll hopefully remember to come back to this next week
when the merge window is over, and take a second look.

 Linus


Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018, 04:34 Johannes Schindelin
 wrote:
>
> This is a real problem.

No it isn't.

We already handle those special cases specially, and install them in
the bin directory (as opposed to libexec). And it all works fine.

Look into the bin directory some day. You'll find things like

  git-cvsserver
  gitk
  git-receive-pack
  git-shell
  git-upload-archive
  git-upload-pack

there, and the fact that a couple of them happen to be built-ins is an
IMPLEMENTATION DETAIL, not a "Oh we should have used just 'git' for
them".

The design of having separate programs is the *good* conceptual
design. And we damn well should keep it for these things that are used
for special purposes.

The fact that two of them have become built-ins as part of the git
binary is incidental. It shouldn't be visible in the names, because it
really is just an internal implementation thing, not anything
fundamental.

> And it is our own darned fault because we let an
> implementation detail bleed into a protocol. We could have designed that a
> lot better.

And by "we" you clearly mean "not you", and by "we could have designed
that a lot better" you must mean "and it was very well designed by
competent people who didn't use bad operating systems".

> Of course we should fix this, though. There is literally no good reason

Go away.

We shouldn't fix it, it's all fine as-is, and there were tons of
f*cking good reasons for why git did what it did. The main one being
"it's a collection of scripts", which was what git _was_, for
chrissake. And using spaces and running some idiotic and
hard-to-verify script de-multiplexer is the WRONG THING for things
like "git-shell" and "git-receive-pack" and friends.

Right now you can actually verify exactly what "git-shell" does. Or
you could replace - or remove - it entirely if you don't like it. And
never have to worry about running "git" with some "shell" subcommand.

And you know that it's not an alias, for example.  Because "git-xyz"
simply does not look up aliases.

So really. Go away, Johannes. Your concerns are complete and utter BS.

The real problem is that Windows is badly designed, but since it's
easy to work around (by using hard-linking on Windows), nobody sane
cares.

The solution is simple, and was already suggested: use symlinks (like
we used to!) on non-windows systems. End of story.

And for the libexec thing, we might want to deprecate those names, if
somebody wants to, but it's not like it actually hurts, and it gives
backwards compatibility.

Btw, real Windows people know all about backwards compatibility. Ask
around competent people inside MS whether it's an important thing.

So stop this idiotic "bad design" crap. Somebody working on Windows
simply can't afford your attitude.

Somebody who didn't design it in the first place can't afford your attitude.

 Linus


Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 10:05 AM, Johannes Schindelin
 wrote:
> The most sensible thing, of course, would be to *not* link the builtins at
> all. I mean, we deprecated the dashed form (which was a design mistake,
> whether you admit it or not) a long time ago.

That's probably not a bad idea for the builtin commands. At least as an option.

We do end up still using the dashed form for certain things, but they
are already special-cased (ie things like "git-receive-pack" and
"git-shell" that very much get executed directly, and for fundamental
reasons).

As to it being a design mistake? No, not really. It made a lot of
sense at the time. The fact is, the problem is Windows, not git. I
know you have your hangups, but that's your problem.

   Linus


Re: [PATCH 3/3] Makefile: optionally symlink libexec/git-core binaries to bin/git

2018-03-14 Thread Linus Torvalds
On Wed, Mar 14, 2018 at 3:14 AM, Ævar Arnfjörð Bjarmason
 wrote:
> On Wed, Mar 14 2018, Johannes Sixt jotted:
>>
>> It is important to leave the default at hard-linking the binaries,
>> because on Windows symbolic links are second class citizens (they
>> require special privileges and there is a distinction between link
>> targets being files or directories). Hard links work well.
>
> Yeah makes sense. I just want to add this as an option, and think if
> it's proven to be un-buggy we could probably turn it on by default on
> the *nix's if people prefer that, but yeah, we'll definitely need the
> uname detection.

I definitely would prefer to make symlinks the default on unix.

It's what we used to do (long long ago), and as you pointed out, it's
a lot clearer what's going on too when you don't have to look at inode
numbers and link counts.

Forcing hardlinking everywhere by default just because Windows
filesystems suck donkey ass through a straw is not the right thing
either.

Linus


Re: [PATCH] revision: drop --show-all option

2018-02-21 Thread Linus Torvalds
On Wed, Feb 21, 2018 at 3:27 PM, Jeff King  wrote:
>
> We'll skip the usual deprecation period because this was
> explicitly a debugging aid that was never documented.

Ack. I don't think I've used it since, and probably nobody else ever used it.

  Linus


Re: linux-next: unnecessary merge in the v4l-dvb tree

2018-02-13 Thread Linus Torvalds
On Tue, Feb 13, 2018 at 9:18 AM, Junio C Hamano  wrote:
>
> That makes me wonder if another heuristic I floated earlier is more
> appropriate.  When merging a tag object T, if refs/tags/T exists and
> it is that tag object, then an updated "merge" would default to "--ff";
> otherwise, it would keep the current default of creating a merge even
> when we could fast-forward, in order to record that tag T in the
> resulting history.

Oooh. Yes, that sounds like the right thing to do.

So the "no fast-forward" logic would trigger only if the name we used
for merging is one of the temporary ones (ie .git/{FETCH,MERGE}_HEAD),
not if the mentioned tag is already a normal tag reference.

Then it's very explicitly about "don't lose the signing information".

I'd still have to teach people to use "--only-ff" if they don't do the
"fetch and merge" model but literally just do  "git pull upstream
vX.Y", but at least the case Mauro describes would automatically just
DTRT.

Me likey.

   Linus


Re: linux-next: unnecessary merge in the v4l-dvb tree

2018-02-12 Thread Linus Torvalds
On Mon, Feb 12, 2018 at 1:44 PM, Junio C Hamano  wrote:
>
> But I wonder why "update to upstream" is merging a signed tag in the
> first place.  Wouldn't downstream's "try to keep up with" pull be
> grabbing from branch tips, not tags?

I'm actually encouraging maintainers to *not* start their work on some
random "kernel of the day".

Particularly during the kernel merge window, the upstream master
branch can be pretty flaky. It's *not* a good point to start new
development on, so if you're a maintainer, you really don't want to
use that as the basis for your work for the next merge window.

So I encourage people to use a stable point for new development, and
particularly actual release kernels. The natural way to do that is
obviously just to create a new branch:

   git checkout -b topicbranch v4.15

and now you have a good new clean branch for your development.

But clearly we've got a few people who have gotten used to the whole
"git pull" convenience of both fetching and updating.

Some maintainers don't even use topic branches, because their main
work is just merging work by subdevelepoers (that goes for my own
tree: I use topic branches for merging patch queues and to
occasionally track my own experimental patches, but 99% of the time
I'm just on "master" and obviously pull other peoples branches).

And some maintainers end up using multiple repositories as branches
(the old _original_ git model). Again, you can just use "git fetch +
git reset", of course, but that's a bit unsafe. In contrast, doing
"git pull --ff-only" is a safe convenient operation that does both the
fetch and the update to whatever state.

But you do need that "--ff-only" to avoid the merge.

  Linus


Re: linux-next: unnecessary merge in the v4l-dvb tree

2018-02-12 Thread Linus Torvalds
On Mon, Feb 12, 2018 at 1:15 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> The reasoning is to avoid losing the signature from the tag (when
> merging a signed tag, the signature gets inserted into the merge
> commit itself - use "git log --show-signature" to see them).

I think the commit that actually introduced the behavior was

fab47d057: merge: force edit and no-ff mode when merging a tag object

back in 2011, so we've had this behavior for a long time. So it's
probably not be worth tweaking the behavior any more, and maybe we
need to educate people to not update to other peoples state with "git
pull".

Maybe we could just tell people to have something like

   git config --global alias.update pull --ff-only

and use that for "try to update to upstream".

   Linus


Re: linux-next: unnecessary merge in the v4l-dvb tree

2018-02-12 Thread Linus Torvalds
On Mon, Feb 12, 2018 at 1:00 PM, Stephen Rothwell  wrote:
>
> Linus, this happens a bit after the merge window, so I am wondering
> about the rational of not doing a fast forward merge when merging a
> signed tag (I forget the reasoning).

The reasoning is to avoid losing the signature from the tag (when
merging a signed tag, the signature gets inserted into the merge
commit itself - use "git log --show-signature" to see them).

So when I merge a signed tag, I do *not* want to fast-forward to the
top commit, because then I'd lose the signature from the tag. Thus the
"merging signed tags are non-fast-forward by default" reasoning.

But, yes, that reasoning is really only valid for proper merges of new
features, not for back-merges.

The problem, of course, is that since git is distributed, git doesn't
know who is "upstream" and who is "downstream", so there's no
_technical_ difference between merging a development tree, and a
development tree doing a back-merge of the upstream tree.

Maybe it was a mistake to make signed tag merges non-fast-forward,
since they cause these kinds of issues with people who use "pull" to
update their otherwise unmodified trees.

I can always teach myself to just use --no-ff, since I end up doing
things like verifying at the signatures anyway.

Junio, comments?

   Linus


Left-over COMMIT_EDITMSG file in gitdir

2018-02-08 Thread Linus Torvalds
This may be intentional, but if so, it's not obvious..

Back long long ago, the original "git commit" shell script got rewritten in C.

In that rewrite, removing some temporary files seems to have been left
out. At least one: .git/COMMIT_EDITMSG.

In the original commit.sh shell script, we can find this:

  rm -f "$GIT_DIR/COMMIT_MSG" "$GIT_DIR/COMMIT_EDITMSG" "$GIT_DIR/SQUASH_MSG"

after creating the commit.  In the C implementation, we do have a
number of "unlink(...)" calls:

unlink(git_path_cherry_pick_head());
unlink(git_path_revert_head());
unlink(git_path_merge_head());
unlink(git_path_merge_msg());
unlink(git_path_merge_mode());
unlink(git_path_squash_msg());

but no

unlink(git_path_commit_editmsg());

and that *seems* to be an oversight.

Similarly, builtin/tag,c leaves a stale TAG_EDITMSG file behind.
Again, that actually did exist in the original shell script, which
used to do

  trap 'rm -f "$GIT_DIR"/TAG_TMP* "$GIT_DIR"/TAG_FINALMSG
"$GIT_DIR"/TAG_EDITMSG' 0

which caused that file to be removed at exit.

I guess I really don't care much, but those two files struck me when I
was doing a "git gc --prune=now" and looked at what was still left in
the .git directory..

If this is all intentional, never mind.

   Linus


Re: [PATCH] enable core.fsyncObjectFiles by default

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o  wrote:
>
> Well, let's be fair; this is something *ext3* got wrong, and it was
> the default file system back them.

I'm pretty sure reiserfs and btrfs did too..

  Linus


Re: [PATCH] enable core.fsyncObjectFiles by default

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 3:16 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> Or does overall FS activity and raw throughput (e.g. with an ISO copy)
> matter more than general FS contention?

Traditionally, yes.

Also note that none of this is about "throughput". It's about waiting
for a second or two when you do a simple "git commit" or similar.

Also, hardware has changed. I haven't used a rotational disk in about
10 years now, and I don't worry about latencies so much more.

The fact that you get ~100k iops indicates that you probably aren't
using those stupid platters of rust either. So I doubt you can even
trigger the bad cases.

Linus


Re: [PATCH] enable core.fsyncObjectFiles by default

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 2:07 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> The original git design was very much to write each object file
> without any syncing, because they don't matter since a new object file
> - by definition - isn't really reachable. Then sync before writing the
> index file or a new ref.

.. actually, I think it originally sync'ed only before starting to
actually remove objects due to repacking.

The theory was that you can lose the last commit series or whatever,
and have to git-fsck, and have to re-do it, but you could never lose
long-term work. If your machine crashes, you still remember what you
did just before the crash.

That theory may have been more correct back in the days than it is
now. People who use git might be less willing to treat it like a
filesystem that you can fsck than I was back 10+ ago.

It's worth noting that the commit that Junio pointed to (from 2008)
didn't actually change any behavior. It just allowed people who cared
to change behavior. The original "let's not waste time on fsync every
object write, because we can just re-create the state anyway" behavior
goes back to 2005.

  Linus


Re: [PATCH] enable core.fsyncObjectFiles by default

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered
> on the tests I thought might do a lot of loose object writes:
>
>   $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux 
> GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master 
> fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh
>   [...]
>   Test
> fsync-on~ fsync-on
>   
> ---
>   3400.2: rebase on top of a lot of unrelated changes 
> 1.45(1.30+0.17)   1.45(1.28+0.20) +0.0%
>   3400.4: rebase a lot of unrelated changes without split-index   
> 4.34(3.71+0.66)   4.33(3.69+0.66) -0.2%
>   3400.6: rebase a lot of unrelated changes with split-index  
> 3.38(2.94+0.47)   3.38(2.93+0.47) +0.0%
>   0007.2: write_locked_index 3 times (3214 files) 
> 0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
>
>No impact. However I did my own test of running the test suite 10%
>times with/without this patch, and it runs 9% slower:
>
>  fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 
> 21.63 21.95
>   fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 
> 23.83 23.88

That's not the thing you should check.

Now re-do the test while another process writes to a totally unrelated
a huge file (say, do a ISO file copy or something).

That was the thing that several filesystems get completely and
horribly wrong. Generally _particularly_ the logging filesystems that
don't even need the fsync, because they use a single log for
everything (so fsync serializes all the writes, not just the writes to
the one file it's fsync'ing).

The original git design was very much to write each object file
without any syncing, because they don't matter since a new object file
- by definition - isn't really reachable. Then sync before writing the
index file or a new ref.

But things have changed, I'm not arguing that the code shouldn't be
made safe by default. I personally refuse to use rotating media on my
machines anyway, largely exactly because of the fsync() issue (things
like "firefox" started doing fsync on the mysql database for stupid
things, and you'd get huge pauses).

But I do think your benchmark is wrong. The case where only git does
something is not interesting or relevant. It really is "git does
something _and_ somebody else writes something entirely unrelated at
the same time" that matters.

  Linus


Re: git pull

2017-11-19 Thread Linus Torvalds
On Sun, Nov 19, 2017 at 7:37 PM, Junio C Hamano  wrote:
>> ...
>> Which is simple. Just create a .git/hooks/prepare-commit-msg file that 
>> contains
>>
>>   #!/bin/sh
>>   sed -i 's|ssh://gitolite.kernel.org/|git://git.kernel.org/|g' "$1"
>>
>> and make it executable, and git will do that commit message editing for you.
>
> This should work with any recent versions of Git (1.7.4.2 and
> upwards), but it still is a workaround.  Should we mark it as a
> feature request in the Git land to record the URL you typed as-is in
> the builtin/fetch.c::store_updated_refs() function, instead of the
> one that was rewritten by the insteadOf mechanism?

The main problem with the prepare-commit-msg thing is actually that is
such a nasty hack, and it can change other things than just the remote
name. Maybe "gitolite" might be mentioned in the shortlog of the
merge, and then the sed script comes and edits that part too.

It is really not a huge issue simply because those things don't really
happen in real life, but it's the main thing about that
prepare-commit-msg hook that makes me go "eww, what an ugly hack".

But it's an ugly hack that just happens to work pretty well in practice.

And yes, I looked at alternatives. In fact, I  looked at a couple of
other approaches:

 - the one you mention, namely to remember the original url, and use
that instead

 - the one I'd actually prefer, which is to generalize the whole
"insteadOf" to work in more situations.

Why would I prefer that second one? It turns out that the "original"
isn't actually necessarily what I'd want either. Several people send
me pointers to "https://git.kernel.org/; and I prefer rewriting them
to git:// just to be consistent. And now that people have started
doing the "insteadOf", their pull requests end up containing that
"git@gitolite" version of the URL, so again, I'd actually like to
rewrite the url _anyway_ in the commit message.

For example, for the kernel, the git.kernel.org address is very
common, but it also has a very common path prefix, so almost all pull
messages for the kernel have that

   "git://git.kernel.org/pub/scm/linux/kernel/git/"

part in common, and I have occasioally felt that it's not adding a lot
of value particularly as it shows up in shortlogs and gitk. I could
change my own rules for the first line (instead of the "Merge tag xyz
from git://..." maybe I should just have my human-legible version),
but I have also considered just rewriting the url to something that
shortens that very common thing.

So maybe

   Merge tag 'sound-4.10-rc4' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

could be instead

   Merge tag 'sound-4.10-rc4' of git://kernel.org/../tiwai/sound

which would keep the _important_ part, and shorten the boilerplate
part away. But that kind of "insteadOf" would only be for the message,
since the end result isn't actually a "real" URL at all, it's
literally a rewritten shorthand.

Again, I can do all of this with the sed script. But again, it's kind
of wrong to do it on the whole message, when it's really only the url
that it should affect.

So it would potentially be nice to just have a generic "rewrite the
url" model, where you can do it for remote fetches, but you could also
do it for just the commit message, or you could do it for doing pushes
(we do have that "pushinsteadof" already - exactly because you might
want to pull and push from different versions, with the push having to
use ssh).

But, as you say:

> It would probably need an update to "struct remote" to have new
> fields, to teach remote.c::alias_all_urls() not to overwrite the
> url[] (and pushurl[] merely for symmetry) fields, to add a field to
> "struct transport" and teach transport.c::transport_get() to record
> the original URL in it so that builtin/fetch.c::fetch_refs() can
> give it to store_updated_refs() instead of the rewritten one.

Yes, the existing "insteadOf" is actually hidden surprisingly deep in
the remote code, and it's very non-obvious. That works ok for the pull
and push case, but really not for just the message rewriting case
(which doesn't happen as part of the pull, but as part of the "git
merge" that then takes the FETCH_HEAD or MERGE_HEAD that contains the
url, and parses that).

Anyway, it's not a big deal. The sed script works. It's a bit hacky,
but it does the job.

I might have wished for a different model, but it's almost certainly
not worth the effort unless somebody really gets fired up about this
and decides they really want to do it.

Linus


Re: RFC v3: Another proposed hash function transition plan

2017-10-02 Thread Linus Torvalds
On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooper  wrote:
>
> Ahhh, so if I understand you correctly, you'd prefer SHA-256 over
> SHA3-256 because it's more performant for your usecase?  Well, that's a
> completely different animal that cryptographic suitability.

In almost all loads I've seen, zlib inflate() cost is a bigger deal
than the crypto load. The crypto people talk about cycles per byte,
but the deflate code is what usually takes the page faults and cache
misses etc, and has bad branch prediction. That ends up easily being
tens or thousands of cycles, even for small data.

But it does obviously depend on exactly what you do. The Windows
people saw SHA1 as costly mainly due to the index file (which is just
a "fancy crc", and not even cryptographically important, and where the
cache misses actually happen when doing crypto, not decompressing the
data).

And fsck and big initial checkins can have a very different profile
than most "regular use" profiles. Again, there the crypto happens
first, and takes the cache misses. And the crypto is almost certainly
_much_ cheaper than just the act of loading the index file contents in
the first place. It may show up on profiles fairly clearly, but that's
mostly because crypto is *intensive*, not because crypto takes up most
of the cycles.

End result: honestly, the real cost on almost any load is not crypto
or necessarily even (de)compression, even if those are the things that
show up. It's the cache misses and the "get data into user space"
(whether using "read()" or page faulting). Worrying about cycles per
byte of compression speed is almost certainly missing the real issue.

The people who benchmark cryptography tend to intentionally avoid the
actual real work, because they just want to know the crypto costs. So
when you see numbers like "9 cycles per byte" vs "12 cycles per byte"
and think that it's a big deal - 30% performance difference! -  it's
almost certainly complete garbage. It may be 30%, but it is likely 30%
out of 10% total, meaning that it's almost in the noise for any but
some very special case.

 Linus


Re: [PATCH] diff: correct newline in summary for renamed files

2017-09-27 Thread Linus Torvalds
On Wed, Sep 27, 2017 at 2:58 PM, Stefan Beller  wrote:
>
>  Linus, I assumed your sign off for the original patch. Thanks for spotting.
>
>  Adding the mode change to t4016 seems like the easiest way to test it.

Looks good to me, and you don't need to give me authorship credit.
Just a "Reported-by:" is fine by me.

But yes, you can also consider the patch signed off by me, although
with your test update, _most_ of the patch is yours so it feels kind
of stupid to mark me as author.

  Linus


Re: diffstat summary mode change bug

2017-09-27 Thread Linus Torvalds
On Wed, Sep 27, 2017 at 1:40 PM, Stefan Beller  wrote:
>
> I disagree with this analysis, as the fix you propose adds the
> new line unconditionally, i.e. this code path would be broken
> regardless of "show filename or not".

Right. Because it is what we want.

The old code (before that commit) used to have two different cases:

fprintf(file, "%s mode change %06o => %06o%c",
line_prefix, p->one->mode,
p->two->mode, show_name ? ' ' : '\n');

ie if "show_name" was set, it would *not* print a newline, and print a
space instead.

But then on the very next line, it used to do:

if (show_name) {
write_name_quoted(p->two->path, file, '\n');

ie now it prints the filename, and then prints the newline.

End result: it used to *always* print the newline. Either it printed
it at the end of the mode (for the non-show_name case), or it printed
it at the end of the filename (for the show_name case).

Your patch removed the '\n' entirely.

My patch makes it unconditional, which it was before your patch (it
was "conditional" only in where it was printed, not _whether_ it was
printed).

> I wonder why our tests failed to tell us about this.
>
> Specifically we have t4100/t-apply-4.expect
>   mode change 100644 => 100755 t/t-basic.sh
>   mode change 100644 => 100755 t/test-lib.sh
> which would seem to exercise this code path.

That only tests "git apply --stat --summary".

It doesn't test "git diff" at all.

And the "mode change" printout is entirely different code (see apply.c
vs diff.c).

 Linus


diffstat summary mode change bug

2017-09-27 Thread Linus Torvalds
Current git shows file-mode changes incorrectly in the diffstat
summary, as I just noted from a pull request I did on the kernel.

The pull request *should* have resulted in a summary like this:

   ...
   21 files changed, 247 insertions(+), 67 deletions(-)
   mode change 100644 => 100755 tools/testing/selftests/memfd/run_tests.sh
   create mode 100644 tools/testing/selftests/net/reuseaddr_conflict.c

but instead that "mode change" line didn't have a newline at the end, and I got

   ...
   21 files changed, 247 insertions(+), 67 deletions(-)
   mode change 100644 => 100755
tools/testing/selftests/memfd/run_tests.sh create mode 100644
tools/testing/selftests/net/reuseaddr_conflict.c

(ok, linewrapping in this email may make that look wrong - but the
"mode change" land the "create mode" are both on the same line.

Bisecting it got me:

  146fdb0dfe445464fa438f3835557c58a01d85d7 is the first bad commit
  commit 146fdb0dfe445464fa438f3835557c58a01d85d7
  Author: Stefan Beller 
  Date:   Thu Jun 29 17:07:05 2017 -0700

  diff.c: emit_diff_symbol learns about DIFF_SYMBOL_SUMMARY

  Signed-off-by: Stefan Beller 
  Signed-off-by: Junio C Hamano 

and the reason seems to be that the '\n' at the end got dropped as the
old code was very confusing (the old code had two different '\n' cases
for the "show filename or not").

I think the right fix is this whitespace-damaged trivial one-liner:

  diff --git a/diff.c b/diff.c
  index 3c6a3e0fa..653bb2e72 100644
  --- a/diff.c
  +++ b/diff.c
  @@ -5272,6 +5272,7 @@ static void show_mode_change(struct
diff_options *opt, struct diff_filepair *p,
  strbuf_addch(, ' ');
  quote_c_style(p->two->path, , NULL, 0);
  }
  +   strbuf_addch(, '\n');
  emit_diff_symbol(opt, DIFF_SYMBOL_SUMMARY,
   sb.buf, sb.len, 0);
  strbuf_release();

but somebody should double-check that.

Linus


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Linus Torvalds
On Wed, Sep 13, 2017 at 6:43 AM, demerphq  wrote:
>
> SHA3 however uses a completely different design where it mixes a 1088
> bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> is *preserved between each block*.

Yes. And considering that the SHA1 attack was actually predicated on
the fact that each block was independent (no extra state between), I
do think SHA3 is a better model.

So I'd rather see SHA3-256 than SHA256.

  Linus


Re: BUG: attempt to trim too many characters

2017-09-12 Thread Linus Torvalds
Just reminding people that this issue would seem to still exist in
current master..

It's trivial to show:

   [torvalds@i7 git]$ git bisect start
   [torvalds@i7 git]$ git bisect bad master
   [torvalds@i7 git]$ git bisect good master~5
   Bisecting: 23 revisions left to test after this (roughly 5 steps)
   [f35a1d75b5ecefaef7b1a8ec55543c82883df82f] Merge branch
'rs/t3700-clean-leftover' into maint
   [torvalds@i7 git]$ git rev-parse --bisect
   fatal: BUG: attempt to trim too many characters

(Note: I use "git rev-parse --bisect" to show it as an error on the
command line, but normal people would obviously do "gitk --bisect"
that then does that "git rev-parse" internally and shows a UI error
box instead).

  Linus

On Tue, Sep 5, 2017 at 3:03 PM, Jeff King <p...@peff.net> wrote:
> On Tue, Sep 05, 2017 at 02:55:08PM -0700, Linus Torvalds wrote:
>
>> On Tue, Sep 5, 2017 at 2:50 PM, Jeff King <p...@peff.net> wrote:
>> >
>> > What version of git are you running? This should be fixed by 03df567fbf
>> > (for_each_bisect_ref(): don't trim refnames, 2017-06-18) which is in
>> > v2.14.
>>
>> I'm way more recent than 2.14.
>>
>> I'm at commit 238e487ea ("The fifth batch post 2.14")
>
> Ugh. Bitten again by the fact that rev-parse and revision.c implement
> the same things in subtly different ways.
>
> This probably fixes it:
>
> diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
> index 2bd28d3c08..9f24004c0a 100644
> --- a/builtin/rev-parse.c
> +++ b/builtin/rev-parse.c
> @@ -757,8 +757,8 @@ int cmd_rev_parse(int argc, const char **argv, const char 
> *prefix)
> continue;
> }
> if (!strcmp(arg, "--bisect")) {
> -   for_each_ref_in("refs/bisect/bad", 
> show_reference, NULL);
> -   for_each_ref_in("refs/bisect/good", 
> anti_reference, NULL);
> +   for_each_fullref_in("refs/bisect/bad", 
> show_reference, NULL, 0);
> +   for_each_fullref_in("refs/bisect/good", 
> anti_reference, NULL, 0);
> continue;
> }
> if (opt_with_value(arg, "--branches", )) {


Re: BUG: attempt to trim too many characters

2017-09-05 Thread Linus Torvalds
On Tue, Sep 5, 2017 at 3:03 PM, Jeff King  wrote:
>
> This probably fixes it:

Yup. Thanks.

   Linus


Re: BUG: attempt to trim too many characters

2017-09-05 Thread Linus Torvalds
On Tue, Sep 5, 2017 at 2:50 PM, Jeff King  wrote:
>
> What version of git are you running? This should be fixed by 03df567fbf
> (for_each_bisect_ref(): don't trim refnames, 2017-06-18) which is in
> v2.14.

I'm way more recent than 2.14.

I'm at commit 238e487ea ("The fifth batch post 2.14")

  Linus


BUG: attempt to trim too many characters

2017-09-05 Thread Linus Torvalds
I just got this with

   gitk --bisect

while doing some bisection on my current kernel.

It happens with "git rev-parse --bisect" too, but interestingly, "git
log --bisect" works fine.

I have not tried to figure out anything further, except that it was
introduced by commit b9c8e7f2f ("prefix_ref_iterator: don't trim too
much") and it happens with

   iter->iter0->refname = "refs/bisect/bad"
   iter->trim = 15

(where 15 is also the same as the length of that refname).

Not having time to chase this down any more, since I'm busy with the
merge window (and that annoying bisect)

  Linus


Re: [PATCH] strbuf: let strbuf_addftime handle %z and %Z itself

2017-06-13 Thread Linus Torvalds
On Mon, Jun 12, 2017 at 2:10 PM, Jeff King  wrote:
>
> I suspect nobody has complained because we generally encourage real
> "-0800" names when specifying zones.

That's what any sane person uses, and it's what SMTP requiries.

The timezone names are a (bad) joke. If a human can't understand them
(and people don't), there's no point in thinking that a computer
should either.

So yes, timezones should strive very much to be numeric. The names are
an incomprehensible mess. Don't rely on them, and try to avoid using
them.

  Linus


Re: Unaligned accesses in sha1dc

2017-06-02 Thread Linus Torvalds
On Fri, Jun 2, 2017 at 1:17 PM, demerphq  wrote:
> Most hash function implementations have code like the following
> (extracted and reduced from hv_macro.h in perl.git [which only
> supports little-endian hash functions]):

Yes.

Please do *not* try to make things overly portable by adding random
memcpy() functions.

Yes, many compilers will do ok with it, and generate the right code
(single load from an unaligned address). But many won't. Gcc
completely screws it up in older versions on ARM, for example.

Dereferencing an unaligned pointer may be "undefined" in some
technical meaning, but it sure as hell isn't undefined in reality, and
compilers that willfully do stupid things should not be catered to
overly. Reality is a lot more important.

And I think gcc can be tweaked to use "movbe" on x86 with the right
magic incantation (ie not just __builtin_bswap32(), but also the
switch to enable the new instructions).  So having code to detect gcc
and using __builtin_bswap32() would indeed be a good thing.

 Linus


Re: git merges of tags

2017-05-19 Thread Linus Torvalds
On Thu, May 18, 2017 at 4:23 PM, Stephen Rothwell  wrote:
>
> Just a reminder that if you are merging Linus' tree (or any tree
> really) via a tag, git was changed some time ago so that merging a tag
> will not do a fast forward (there is a good reason for this - I just
> can't recall it ATM).

The reason is that when you merge a signed tag, git squirrels away t
he signature into the merge commit, so that you can see and verify the
signage later (use "git log --show-signatures" to see the signatures
on the commits).

If you fast-forward, there isn't any new commit to add the signing data to.

> To do the fast forward, try "git merge ^{}" ...

A slightly simpler syntax might be just "tag^0", but yes, the "^{}"
thing peels off any tags.

>   (unfortunately
> doing "git merge --ff " also does not do a fast forward - it also
> doesn't fail, it unexpectedly just creates a merge commit :-().

"--ff" is the default behavior, and means "allow fast forward", but
note that it is about "allowing", not "forcing".

You can use "--ff-only" to say that you will _only_ accept a
fast-forward, and git will error out  if it needs to create a merge.

  Linus


Re: [TANGENT] run-command: use vfork instead of fork

2017-05-16 Thread Linus Torvalds
On Tue, May 16, 2017 at 12:35 PM, Eric Wong  wrote:
>
> Fwiw, most of the vfork preparation was already done by Brandon
> and myself a few weeks ago, and cooking in pu.

Oh, interesting. Was that done for vfork(), or is it for something
else? Some of the changes seem almost overly careful. Is this for
prep-work porting to some odd environment that doesn't really have a
MMU at all? There's nothing fundamentally wrong with allocating memory
after fork().

But yes, it looks like it helps the vfork case.

   Linus


Re: git rebase regression: cannot pass a shell expression directly to --exec

2017-05-16 Thread Linus Torvalds
On Tue, May 16, 2017 at 1:12 PM, Johannes Schindelin
 wrote:
>>
>> I think it would be better to just
>>
>>  (a) get rid of the magic strcspn() entirely
>>
>>  (b) make the 'can we optimize this' test be simply just looking up
>> 'argv[0]' in $PATH
>
> What about
>
> ABC=1 my-executable my-arg

What about it?

Do you have a binary like

'/usr/bin/ABC=1 my-executable my-arg'

or something? If so, it gets executed. If not, it would get passed off
to "/bin/sh".

That sounds a lot more obvious than "some random set of characters are magical".

  Linus


Re: git rebase regression: cannot pass a shell expression directly to --exec

2017-05-16 Thread Linus Torvalds
On Tue, May 16, 2017 at 10:23 AM, Jeff King  wrote:
>
> I think the logic here would be more like:
>
>   1. During prepare_shell_cmd(), even if we optimize out the shell call,
>  still prepare a fallback argv (since we can't allocate memory
>  post-fork).
>
>   2. In the forked child, if we get ENOENT from exec and cmd->use_shell
>  is set, then exec the fallback shell argv instead. Propagate its
>  results, even if it's 127.
>
> That still means we'd prefer a $PATH copy of a command to its shell
> builtin variant, but that can't be helped (and I kind of doubt anybody
> would care too much).

I think it would be better to just

 (a) get rid of the magic strcspn() entirely

 (b) make the 'can we optimize this' test be simply just looking up
'argv[0]' in $PATH

Hmm? If you have executables with special characters in them in your
PATH, you have bigger issues.

Also, if people really want to optimize the code that executes an
external program (whether in shell or directly), I think it might be
worth it to look at replacing the "fork()" with a "vfork()".

Something like this

-   cmd->pid = fork();
+   cmd->pid = (cmd->git_cmd || cmd->env) ? fork() : vfork();

might work (the native git_cmd case needs a real fork, and if we
change the environment variables we need it too, but the other cases
look like they might work with vfork()).

Using vfork() can be hugely more efficient, because you don't have the
extra page table copies and teardown, but also avoid a lot of possible
copy-on-write faults.

 Linus


[PATCH] Fix 'git am' in-body header continuations

2017-04-02 Thread Linus Torvalds

From: Linus Torvalds <torva...@linux-foundation.org>
Date: Sat, 1 Apr 2017 12:14:39 -0700
Subject: [PATCH] Fix 'git am' in-body header continuations

An empty line should stop any pending in-body headers, and start the
actual body parsing.

This also modifies the original test for the in-body headers to actually
have a real commit body that starts with spaces, and changes the test to
check that the long line matches _exactly_, and doesn't get extra data
from the body.

Fixes:6b4b013f1884 ("mailinfo: handle in-body header continuations")
Cc: Jonathan Tan <jonathanta...@google.com>
Cc: Jeff King <p...@peff.net>
Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---

On Sun, 2 Apr 2017, Junio C Hamano wrote:
> 
> And that is exactly your patch does.  The change "feels" correct to
> me.

Ok, resent with the test-case for the original behavior changed to be 
stricter (so it fails without this fix), and with Signed-off lines etc.

I didn't really test the test-case very much, but it seemed to fail 
without this patch (because the "Body test" thing from the body becomes 
part of the long first line), and passes with it.

But somebody who is more used to the test-suite should double-check my 
stupid test edit.

 mailinfo.c| 7 ++-
 t/t4150-am.sh | 6 --
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/mailinfo.c b/mailinfo.c
index a489d9d0f..68037758f 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -757,8 +757,13 @@ static int handle_commit_msg(struct mailinfo *mi, struct 
strbuf *line)
assert(!mi->filter_stage);
 
if (mi->header_stage) {
-   if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
+   if (!line->len || (line->len == 1 && line->buf[0] == '\n')) {
+   if (mi->inbody_header_accum.len) {
+   flush_inbody_header_accum(mi);
+   mi->header_stage = 0;
+   }
return 0;
+   }
}
 
if (mi->use_inbody_headers && mi->header_stage) {
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 89a5bacac..44807e218 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -983,7 +983,9 @@ test_expect_success 'am works with multi-line in-body 
headers' '
rm -fr .git/rebase-apply &&
git checkout -f first &&
echo one >> file &&
-   git commit -am "$LONG" --author="$LONG <l...@example.com>" &&
+   git commit -am "$LONG
+
+Body test" --author="$LONG <l...@example.com>" &&
git format-patch --stdout -1 >patch &&
# bump from, date, and subject down to in-body header
perl -lpe "
@@ -997,7 +999,7 @@ test_expect_success 'am works with multi-line in-body 
headers' '
git am msg &&
# Ensure that the author and full message are present
git cat-file commit HEAD | grep "^author.*l...@example.com" &&
-   git cat-file commit HEAD | grep "^$LONG"
+   git cat-file commit HEAD | grep "^$LONG$"
 '
 
 test_done
-- 
2.12.2.578.g5c4e54f4e



Re: Bug in "git am" when the body starts with spaces

2017-04-01 Thread Linus Torvalds
On Fri, Mar 31, 2017 at 5:52 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> The continuation logic is oddly complex, and I can't follow the logic.
> But it is completely broken in how it thinks empty lines are somehow
> "continuations".

The attached patch seems to work for me. Comments?

The logic is fairly simple: if we encounter an empty line, and we have
pending in-body headers, we flush the pending headers, and mark us as
no longer in header mode.

The code used to just ignore empty lines when in header mode, which is
garbage, exactly because it would happily continue accumulating more
headers.

There may be other cases this misses, but this at least seems to fix
the case I encountered.

Linus
 mailinfo.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mailinfo.c b/mailinfo.c
index a489d9d0f..68037758f 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -757,8 +757,13 @@ static int handle_commit_msg(struct mailinfo *mi, struct 
strbuf *line)
assert(!mi->filter_stage);
 
if (mi->header_stage) {
-   if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
+   if (!line->len || (line->len == 1 && line->buf[0] == '\n')) {
+   if (mi->inbody_header_accum.len) {
+   flush_inbody_header_accum(mi);
+   mi->header_stage = 0;
+   }
return 0;
+   }
}
 
if (mi->use_inbody_headers && mi->header_stage) {


Re: Bug in "git am" when the body starts with spaces

2017-03-31 Thread Linus Torvalds
Ok, did a bisect, and this bisects to commit 6b4b013f1884 ("mailinfo:
handle in-body header continuations").

The continuation logic is oddly complex, and I can't follow the logic.
But it is completely broken in how it thinks empty lines are somehow
"continuations".

Jonathan?

 Linus

On Fri, Mar 31, 2017 at 5:24 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> I think the reason is that the "header continuation line" logic kicks
> in because the lines in the body start with spaces, but that's
> entirely incorrect, since
>
>  (a) we're not in an email header
>
>  (b) there's an empty line in between anyway, so no way are those body
> lines continuation lines.
>
> I didn't check how far back this goes, I guess I'll do that next. But
> I thought I'd report it here first in case somebody else goes "ahhh".


Bug in "git am" when the body starts with spaces

2017-03-31 Thread Linus Torvalds
Try applying the attached patch with

   git am 0001-Test-patch.patch

in the git repository.

At least for me, it results in a very odd commit that has one single
line in the commit message:

Test patch This should go in the body not in the subject line

which is obviously bogus.

I think the reason is that the "header continuation line" logic kicks
in because the lines in the body start with spaces, but that's
entirely incorrect, since

 (a) we're not in an email header

 (b) there's an empty line in between anyway, so no way are those body
lines continuation lines.

I didn't check how far back this goes, I guess I'll do that next. But
I thought I'd report it here first in case somebody else goes "ahhh".

Linus
From ad65cf7ba97ac071da1f845ec854165e7bf1efdf Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Fri, 31 Mar 2017 17:18:16 -0700
Subject: [PATCH] Test patch example

Subject: [PATCH] Test patch

  This should go in the body
  not in the subject line
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 9b36068ac..9f36c149b 100644
--- a/Makefile
+++ b/Makefile
@@ -1,3 +1,4 @@
+
 # The default target of this Makefile is...
 all::
 
-- 
2.12.2.401.g5d4234a49



Re: USE_SHA1DC is broken in pu

2017-03-23 Thread Linus Torvalds
On Thu, Mar 23, 2017 at 9:43 AM, Johannes Schindelin
 wrote:
>
> What I am saying is that this should be a more fine-grained, runtime knob.

No it really shouldn't.

> If I write out an index, I should not suffer the slowdown from detecting
> collisions.

The index case is a complete red herring.

As already noted, the proper fix for the index case is to simply do it
asynchronously on read. On write, it's harder to do asynchronously,
but for a 300MB index file you're likely going to be doing IO in the
middle, so it's probably not even noticeable.

But even *if* you want to worry about the index file, it shouldn't be
a runtime knob. The index file is completely different from the other
uses of SHA1DC.

But the fact is, if you don't want SHA1DC, and you have crazy special
cases, you just continue to build with openssl support instead. Nobody
else should ever have to worry about *your* crazy cases.

Linus


Re: [PATCH 0/2] Re-integrate sha1dc

2017-03-16 Thread Linus Torvalds
On Thu, Mar 16, 2017 at 3:04 PM, Jeff King  wrote:
>
> There are a few things I think are worth changing. The die() message
> should mention the sha1 we computed. That will be a big help if an old
> version of git tries to unknowingly push a colliding object to a newer
> version. The user will see "collision on sha1 1234.." which gives them a
> starting point to figure out where they got the bad object from.
>
> And to make that work, we have to disable the safe_hash feature (which
> intentionally corrupts a colliding sha1). We _could_ rip it out
> entirely, but since it only kicks in when we see a collision, I doubt
> it's impacting anything.
>
> I also updated the timings in my commit message, and added a basic test.

No complaints about your version.

   Linus


[PATCH 2/2] Integrate the sha1dc code with the git build

2017-03-16 Thread Linus Torvalds

From: Linus Torvalds <torva...@linux-foundation.org>
Date: Thu, 16 Mar 2017 13:08:38 -0700
Subject: [PATCH 2/2] Integrate the sha1dc code with the git build

This adds the proper magic to actually build the sha1dc code as part of
git when USE_SHA1DC is enabled.

This includes

 - adjusting the sha1dc include directives for git use

 - adding the proper USE_SHA1DC logic to the Makefile

 - adding the SHA1DC case to the "hash.h" header

 - adding the proper "git platform" wrappers for the SHA1 interface

Much of this comes from Jeff King's previous integration effort, with
modifications for the new world order of hash.h.

Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---
 Makefile   | 10 ++
 hash.h |  2 ++
 sha1dc/sha1.c  | 33 +
 sha1dc/sha1.h  | 18 --
 sha1dc/ubc_check.c |  4 ++--
 sha1dc/ubc_check.h |  2 --
 6 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/Makefile b/Makefile
index a5a11e721..186ce17f2 100644
--- a/Makefile
+++ b/Makefile
@@ -140,6 +140,10 @@ all::
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
+# Define USE_SHA1DC to unconditionally enable the collision-detecting sha1
+# algorithm. This is slower, but may detect attempted collision attacks.
+# Takes priority over other *_SHA1 knobs.
+#
 # Define SHA1_MAX_BLOCK_SIZE to limit the amount of data that will be hashed
 # in one call to the platform's SHA1_Update(). e.g. APPLE_COMMON_CRYPTO
 # wants 'SHA1_MAX_BLOCK_SIZE=1024L*1024L*1024L' defined.
@@ -1383,6 +1387,11 @@ ifdef APPLE_COMMON_CRYPTO
SHA1_MAX_BLOCK_SIZE = 1024L*1024L*1024L
 endif
 
+ifdef USE_SHA1DC
+   LIB_OBJS += sha1dc/sha1.o
+   LIB_OBJS += sha1dc/ubc_check.o
+   BASIC_CFLAGS += -DSHA1DC
+else
 ifdef BLK_SHA1
LIB_OBJS += block-sha1/sha1.o
BASIC_CFLAGS += -DSHA1_BLK
@@ -1400,6 +1409,7 @@ else
 endif
 endif
 endif
+endif
 
 ifdef SHA1_MAX_BLOCK_SIZE
LIB_OBJS += compat/sha1-chunked.o
diff --git a/hash.h b/hash.h
index f0d9ddd0c..b9e7e34fc 100644
--- a/hash.h
+++ b/hash.h
@@ -3,6 +3,8 @@
 
 #if defined(SHA1_PPC)
 #include "ppc/sha1.h"
+#elif defined(SHA1DC)
+#include "sha1dc/sha1.h"
 #elif defined(SHA1_APPLE)
 #include 
 #elif defined(SHA1_OPENSSL)
diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
index 8d12b832b..76a8d1a85 100644
--- a/sha1dc/sha1.c
+++ b/sha1dc/sha1.c
@@ -5,14 +5,9 @@
 * https://opensource.org/licenses/MIT
 ***/
 
-#include 
-#include 
-#include 
-#include 
-
-#include "sha1.h"
-#include "ubc_check.h"
-
+#include "git-compat-util.h"
+#include "sha1dc/sha1.h"
+#include "sha1dc/ubc_check.h"
 
 /* 
Because Little-Endian architectures are most common,
@@ -1790,3 +1785,25 @@ int SHA1DCFinal(unsigned char output[20], SHA1_CTX *ctx)
output[19] = (unsigned char)(ctx->ihv[4]);
return ctx->found_collision;
 }
+
+static const char collision_message[] =
+"The SHA1 computation detected evidence of a collision attack;\n"
+"refusing to process the contents.";
+
+void git_SHA1DCFinal(unsigned char hash[20], SHA1_CTX *ctx)
+{
+   if (SHA1DCFinal(hash, ctx))
+   die(collision_message);
+}
+
+void git_SHA1DCUpdate(SHA1_CTX *ctx, const void *vdata, unsigned long len)
+{
+   const char *data = vdata;
+   /* We expect an unsigned long, but sha1dc only takes an int */
+   while (len > INT_MAX) {
+   SHA1DCUpdate(ctx, data, INT_MAX);
+   data += INT_MAX;
+   len -= INT_MAX;
+   }
+   SHA1DCUpdate(ctx, data, len);
+}
diff --git a/sha1dc/sha1.h b/sha1dc/sha1.h
index e867724c0..37ee415da 100644
--- a/sha1dc/sha1.h
+++ b/sha1dc/sha1.h
@@ -9,8 +9,6 @@
 extern "C" {
 #endif
 
-#include 
-
 /* uses SHA-1 message expansion to expand the first 16 words of W[] to 80 
words */
 /* void sha1_message_expansion(uint32_t W[80]); */
 
@@ -100,6 +98,22 @@ void SHA1DCUpdate(SHA1_CTX*, const char*, size_t);
 /* returns: 0 = no collision detected, otherwise = collision found => warn 
user for active attack */
 int  SHA1DCFinal(unsigned char[20], SHA1_CTX*); 
 
+
+/*
+ * Same as SHA1DCFinal, but convert collision attack case into a verbose die().
+ */
+void git_SHA1DCFinal(unsigned char [20], SHA1_CTX *);
+
+/*
+ * Same as SHA1DCUpdate, but adjust types to match git's usual interface.
+ */
+void git_SHA1DCUpdate(SHA1_CTX *ctx, const void *data, unsigned long len);
+
+#define platform_SHA_CTX SHA1_CTX
+#define platform_SHA1_Init SHA1DCInit
+#define platform_SHA1_Update git_SHA1DCUpdate
+#define platform_SHA1_Final git_SHA1DCFinal
+
 #if defined(__cplusplus)
 }
 #endif
diff --git a/sha1dc/ubc_check.c b/sha1dc/ubc_check.c
index 27d0976da..089dd4743 100644
--- a/sha1dc/ubc_check.c
+++ b/sha1dc/ubc_check.c
@@ -24,8 +24,8 @@
 // ubc_check has been ve

Re: USE_SHA1DC is broken in pu

2017-03-16 Thread Linus Torvalds
On Thu, Mar 16, 2017 at 12:51 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> I'll send a patch on top of 'next', which already has the header file changes.

Patches sent. It all looked fairly straightforward to me, but maybe I
missed something.

 Linus


[PATCH 0/2] Re-integrate sha1dc

2017-03-16 Thread Linus Torvalds

I suspect the first patch will not make it to the list since it's over 
100kB in size, but oh well.. Junio and Jeff will see it.

This is sent as two patches, just to have the original upstream code as a 
first step, and then the second patch does the small modifications to 
integrate it with git.

It "WorksForMe(tm)" and the integration patches are now fairly trivial, 
since upstream already did the dieting and some of the semantic changes to 
gits more traditional C code.

I did leave the C++ wrapper lines that the sha1dc header files have grown 
in the meantime, I debated removing them but felt that "closer to 
upstream" was worth it.

   Linus


Re: USE_SHA1DC is broken in pu

2017-03-16 Thread Linus Torvalds
On Thu, Mar 16, 2017 at 12:46 PM, Junio C Hamano  wrote:
>
> That's easy to answer.  What we have on 'pu' is a fair game for
> wholesale replacement.  That is the whole point of not merging
> topics in flux to 'next' and declaring that 'pu' will constantly
> rewind.

Ok.

I'll send a patch on top of 'next', which already has the header file changes.

 Linus


Re: USE_SHA1DC is broken in pu

2017-03-16 Thread Linus Torvalds
On Thu, Mar 16, 2017 at 12:41 PM, Jeff King  wrote:
>
> Potentially we should just eject sha1dc from "pu" for the moment. It
> needs re-rolled with the most recent version of the collision library
> (and I see Marc just posted that they hit a stable point, which is
> perhaps why you're looking at it).

I looked at it, and even created a patch, and then decided that you'd
probably do it.

But yes, re-integrating entirely (rather than creating a patch against
the previous SHA1DC) might be simpler.

Junio, which way do you want to go?

 Linus


USE_SHA1DC is broken in pu

2017-03-16 Thread Linus Torvalds
I think there's a semantic merge error and it clashes with
f18f816cb158 ("hash.h: move SHA-1 implementation selection into a
header file").

Suggested possible merge resolution attached.

   Linus
 Makefile | 2 +-
 hash.h   | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 9d1d958bd..186ce17f2 100644
--- a/Makefile
+++ b/Makefile
@@ -1388,9 +1388,9 @@ ifdef APPLE_COMMON_CRYPTO
 endif
 
 ifdef USE_SHA1DC
-   SHA1_HEADER = "sha1dc/sha1.h"
LIB_OBJS += sha1dc/sha1.o
LIB_OBJS += sha1dc/ubc_check.o
+   BASIC_CFLAGS += -DSHA1DC
 else
 ifdef BLK_SHA1
LIB_OBJS += block-sha1/sha1.o
diff --git a/hash.h b/hash.h
index f0d9ddd0c..b7f4f1fd8 100644
--- a/hash.h
+++ b/hash.h
@@ -7,6 +7,8 @@
 #include 
 #elif defined(SHA1_OPENSSL)
 #include 
+#elif defined(SHA1DC)
+#include "sha1dc/sha1.h"
 #else /* SHA1_BLK */
 #include "block-sha1/sha1.h"
 #endif


Re: [PATCH] Put sha1dc on a diet

2017-03-13 Thread Linus Torvalds
On Mon, Mar 13, 2017 at 1:12 PM, Marc Stevens  wrote:
> Indeed, I've committed a fix, and a small bug fix for the new code just now.

Unrelated side note: there may be some missing dependencies in the
build infrastructure or something, because when I tried Jeff's script
that did that "test the merge and the two parents", and used the
pack-file of the kernel for testing, I got:

  5611971c610143e6d38bbdca463f4c9f79a056a0
/home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack

  real 0m2.432s
  user 0m2.348s
  sys 0m0.084s

  5611971c610143e6d38bbdca463f4c9f79a056a0
/home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack

  real 0m3.747s
  user 0m3.672s
  sys 0m0.076s

  5611971c610143e6d38bbdca463f4c9f79a056a0 *coll*
/home/torvalds/v2.6/linux/.git/objects/pack/pack-153bb8cd11846cf9a27ef7b1069aa9cb9f5b724f.pack

  real 0m5.061s
  user 0m4.984s
  sys 0m0.077s

never mind the performace, notice the *coll* in that last case.

But doing a "git clean -dqfx; make -j8" and re-testing the same tree,
the issue is gone.

I suspect some dependency on a header file is broken, causing some
object file to not be properly re-built, which in turn then
incorrectly causes the 'ctx2.found_collision' test to test the wrong
bit or something.

 Linus


Re: Stable GnuPG interface, git should use GPGME

2017-03-10 Thread Linus Torvalds
On Fri, Mar 10, 2017 at 2:00 AM, Bernhard E. Reiter
 wrote:
>
> git uses an pipe-and-exec approach to running a GnuPG binary
> as writen in the documentation [1]:
>
> gpg.program
>Use this custom program instead of "gpg" found on $PATH when making
>or verifying a PGP signature. The program must support the same
>command-line interface as GPG
>
> please consider using libgpgme interfacing to GnuPG, because the gpg
> command-line interface is not considered an official API to GnuPG by the
> GnuPG-devs and thus potentially unstable.

Quite frankly, I will NAK this just based on previous bad experiences
with using "helpful" libraries.

Maybe you can lay my worries to rest, but the problems with libraries
in this context tend to be

 - hard to personalize.

   At least right now, we just allow people to set their gpg binary.
I'm betting that the library would pick whatever random preferred
version, and in the process possibly screwed us.

   Example: what if somebody is actually using another pgp
implementation entirely for some reason, and is just scripting around
it?

   Or maybe he's using the regular gnupg, but using different keys for
different projects (using "--homedir"). That's trivial with the
current model. How trivial is that with a library?

 - existing configuration

   This is the main problem I've seen in the past. Using the "ssh"
_program_ is easy. You add your keys, your config files, whatever, and
it "just works" (or rather, you fight it once and it definitely
doesn't "just" work, but then you copy your .ssh directory around for
the rest of your and forget how it ever worked, but it does).

   Using "libssh2" is an exercise in futility, and you have to do a
crazy amount of stupid "look up keys" and simple configuration in your
.ssh/config (like per-host keys, hostname swizzling etc) just don't
pick up the configurations you already did for the program.

 - UI

   For things like gpg, the UI is traditionally horrible. But there
tends to be various things like password agents that help with caching
passwords and/or add a graphical UI to get the password when
necessary.

 - library versioning.

   I don't know why, but I've never *ever* met a library developer who
realized that libraries were all about stable API's, and the library
users don't want to fight different versions.

   And to make matters worse, the different versions (particularly if
you end up having to use a development version due to bugs or required
features etc) are always made horribly bad to even detect at
built-time automatically with simple #ifdef etc, so now you have to do
autoconf crap etc.

Now, it may be that the pgpme library "just works" across
architectures and handles all of the above situations as gracefully as
the external program does. In that case - but _ONLY_ in that case -
would a switch-over to the library possibly be a good thing.

I'd be pleasantly surprised. But I *would* be surprised, because every
time I've seen that "library vs program" model, I've seen the above
issues.

In fact, we have those exact issues very much in git itself too. Yes,
I've used libgit2 (for subsurface). It's a pain in the arse to do
*exactly* the above kinds of things, and the thing is, that isn't
git-specific.

So I'm very down on using external libraries unless they are stable
and have no need for configuration etc. Things like zlib is fine -
there just isn't much to configure outside of the "just how hard do
you want me to try to compress". Nobody has a ".zlib/config" file that
you need to worry about accessing etc.

Of course, maybe pgpme is a world first, and actually does read your
.gnupg/config file trivially, and has all the gpg agent integration
that it picks up automatically, and allows various per-user
configurations, and all my worries are bogus.

But that would literally be the first time I've ever seen that.

   Linus


Re: RFC: Another proposed hash function transition plan

2017-03-07 Thread Linus Torvalds
On Tue, Mar 7, 2017 at 10:57 AM, Ian Jackson
 wrote:
>
> Also I think you need to specify how abbreviated object names are
> interpreted.

One option might be to not use hex for the new hash, but base64 encoding.

That would make the full size ASCII hash encoding length roughly
similar (43 base64 characters rather than 40), which would offset some
of the new costs (longer filenames in the loose format, for example).

Also, since 256 isn't evenly divisible by 6, and because you'd want
some way to explictly disambiguate the new hashes, the rule *could* be
that the ASCII representation of a new hash is the base64 encoding of
the 258-bit value that has "10" prepended to it as padding.

That way the first character of the hash would be guaranteed to not be
a hex digit, because it would be in the range [g-v] (indexes 32..47).

Of course, the downside is that base64 encoded hashes can also end up
looking very much like real words, and now case would matter too.

The "use base64 with a "10" two-bit padding prepended" also means that
the natural loose format radix format would remain the first 2
characters of the hash, but due to the first character containing the
padding, it would be a fan-out of 2**10 rather than 2**12.

Of course, having written that, I now realize how it would cause
problems for the usual shit-for-brains case-insensitive filesystems.
So I guess base64 encoding doesn't work well for that reason.

Linus


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Linus Torvalds
On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan  wrote:
>
> I think "nohash" can be explained in 2 points:

I do think that that was my least favorite part of the suggestion. Not
just "nohash", but all the special "hash" lines too.

I would honestly hope that the design should not be about "other
hashes". If you plan your expectations around the new hash being
broken, something is wrong to begin with.

I do wonder if things wouldn't be simpler if the new format just
included the SHA1 object name in the new object. Put it in the
"header" line of the object, so that every time you look up an object,
you just _see_ the SHA1 of that object. You can even think of it as an
additional protection.

Btw, the multi-collision attack referenced earlier does _not_ work for
an iterated hash that has a bigger internal state than the final hash.
Which is actually a real argument against sha-256: the internal state
of sha-256 is 256 bits, so if an attack can find collisions due to
some weakness, you really can then generate exponential collisions by
chaining a linear collision search together.

But for sha3-256 or blake2, the internal hash state is larger than the
final hash, so now you need to generate collisions not in the 256
bits, but in the much larger search space of the internal hash space
if you want to generate those exponential collisions.

So *if* the new object format uses a git header line like

"blob  \0"

then it would inherently contain that mapping from 256-bit hash to the
SHA1, but it would actually also protect against attacks on the new
hash. In fact, in particular for objects with internal format that
differs between the two hashing models (ie trees and commits which to
some degree are higher-value targets), it would make attacks really
quite complicated, I suspect.

And you wouldn't need those "hash" or "nohash" things at all. The old
SHA1 would simply always be there, and cheap to look up (ie you
wouldn't have to unpack the whole object).

Hmm?

   Linus


Re: Delta compression not so effective

2017-03-05 Thread Linus Torvalds
On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen  wrote:
>
> I reran the repack with the options above (dropping the zlib=9, as you
> suggested)
>
> $ time git -c pack.threads=4 repack -a -d -F \
>--window=350 --depth=250 --window-memory=30g
>
> and ended up with
> $ du -sh .
> 205G.
>
> In other words, going from 6G to 30G window didn't help a lick on finding
> deltas for those binaries.

Ok.

> I did
> fprintf(stderr, "%s %u %lu\n",
> sha1_to_hex(delta_list[i]->idx.sha1),
> delta_list[i]->hash,
> delta_list[i]->size);
>
> I assume that's correct?

Looks good.

> I've removed all commit messages, and "sanitized" some filepaths etc, so
> name hashes won't match what's reported, but that should be fine. (the
> object_entry->hash seems to be just a trivial uint32 hash for sorting
> anyways)

Yes. I see your name list and your pack-file index.

> BUT, if I look at the last 3 entries of the sorted git verify-pack output,
> and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I
> get:
...
> while I cannot find ANY of them in the delta_list output?? \

Yes. You have a lot of of object names in that log file you sent in
private that aren't in the delta list.

Now, objects smaller than 50 bytes we don't ever try to even delta. I
can't see the object sizes when they don't show up in the delta list,
but looking at some of those filenames I'd expect them to not fall in
that category.

I guess you could do the printout a bit earlier (on the
"to_pack.objects[]" array - to_pack.nr_objects is the count there).
That should show all of them. But the small objects shouldn't matter.

But if you have a file like

   extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib

I would have assumed that it has a size that is > 50. Unless those
"extern" things are placeholders?

> You might get an idea for how to easily create a repo which reproduces the
> issue, and which would highlight it more easily for the ML.

Looking at your sorted object list ready for packing, it doesn't look
horrible. When sorting for size, it still shows a lot of those large
files with the same name hash, so they sorted together in that form
too.

I do wonder if your dll data just simply is absolutely horrible for
xdelta. We've also limited the delta finding a bit, simply because it
had some O(m*n) behavior that gets very expensive on some patterns.
Maybe your blobs trigger some of those case.

The diff-delta work all goes back to 2005 and 2006, so it's a long time ago.

What I'd ask you to do is try to find if you could make a reposity of
just one of the bigger DLL's with its history, particularly if you can
find some that you don't think is _that_ sensitive.

Looking at it, for example, I see that you have that file

   extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so

that seems to have changed several times, and is a largish blob. Could
you try creating a repository with git fast-import that *only*
contains that file (or pick another one), and see if that delta's
well?

And if you find some case that doesn't xdelta well, and that you feel
you could make available outside, we could have a test-case...

 Linus


Re: RFC: Another proposed hash function transition plan

2017-03-04 Thread Linus Torvalds
On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder  wrote:
>
> This document is still in flux but I thought it best to send it out
> early to start getting feedback.

This actually looks very reasonable if you can implement it cleanly
enough. In many ways the "convert entirely to a new 256-bit hash" is
the cleanest model, and interoperability was at least my personal
concern. Maybe your model solves it (devil in the details), in which
case I really like it.

I do think that if you end up essentially converting the objects
without really having any true backwards compatibility at the object
layer (just the translation code), you should seriously look at doing
some other changes at the same time. Like not using zlib compression,
it really is very slow.

Btw, I do think the particular choice of hash should still be on the
table. sha-256 may be the obvious first choice, but there are
definitely a few reasons to consider alternatives, especially if it's
a complete switch-over like this.

One is large-file behavior - a parallel (or tree) mode could improve
on that noticeably. BLAKE2 does have special support for that, for
example. And SHA-256 does have known attacks compared to SHA-3-256 or
BLAKE2 - whether that is due to age or due to more effort, I can't
really judge. But if we're switching away from SHA1 due to known
attacks, it does feel like we should be careful.

Linus


Re: SHA1 collisions found

2017-03-02 Thread Linus Torvalds
On Thu, Mar 2, 2017 at 5:50 PM, Mike Hommey  wrote:
>
> What if the "object version" is a hash of the content (as opposed to
> header + content like the normal git hash)?

It doesn't actually matter for that attack.

The concept of the attack is actually fairly simple: generate a lot of
collisions in the first hash (and they outline how you only need to
generate 't' serial collisions and turn them into 2**t collisions),
and then just check those collisions against the second hash.

If you have enough collisions in the first one, having a collision in
the second one is inevitable just from the birthday rule.

Now, *In*practice* that attack is not very easy to do. Each collision
is still hard to generate. And because of the git object rules (the
size has to match), it limits you a bit in what collisions you
generate.

But the fact that git requires the right header can be considered just
a variation on the initial state to SHA1, and then the additional
requirement might be as easy as just saying that your collision
generation function just always needs to generate a fixed-size block
(which could be just 64 bytes - the SHA1 blocking size).

So assuming you can arbitrarily generate collisions (brute force:
2**80 operations) you could make the rule be that you generate
something that starts with one 64-byte block that matches the git
rules:

   "blob 6454\0"..pad with repeating NUL bytes..

and then you generate 100 pairs of 64-byte SHA1 collisions (where the
first starts with the initial value of that fixed blob prefix, the
next with the state after the first block, etc etc).

Now you can generate 2**100 different sequences that all are exactly
6464 bytes (101 64-byte blocks) and all have the same SHA1 - all all
share that first fixed 64-byte block.

You needed "only" on the order of 100 * 2**80 SHA1 operations to do
that in theory.

An since you have 2**100 objects, you know that you will have a likely
birthday paradox even if your secondary hash is 200 bits long.

So all in all, you generated a collision in on the order of 2**100 operations.

So instead of getting the security of "160+200" bits, you only got 200
bits worth of real collision resistance.

NOTE!! All of the above is very much assuming "brute force". If you
can brute-force the hash, you can completely break any hash. The input
block to SHA1 is 64 bytes, so by definition you have 512 bits of data
to play with, and you're generating a 160-bit output: there is no
question what-so-ever that you couldn't generate any hash you want if
you brute-force things.

The place where things like having a fixed object header can help is
when the attack in question requires some repeated patterns. For
example, if you're not brute-forcing things, your attack on the hash
will likely involve using very particular patterns to change a number
of bits in certain ways, and then combining those particular patterns
to get the hash collision you wanted.  And *that* is when you may need
to add some particular data to the middle to make the end result be a
particular match.

But a brute-force attack definitely doesn't need size changes. You can
make the size be pretty much anything you want (modulo really small
inputs, of course - a one-byte input only has 256 different possible
hashes ;) if you have the resources to just go and try every possible
combination until you get the hash you wanted.

I may have overly simplified the paper, but that's the basic idea.

   Linus


Re: SHA1 collisions found

2017-03-02 Thread Linus Torvalds
On Thu, Mar 2, 2017 at 12:43 PM, Junio C Hamano  wrote:
>
> My reaction heavily depends on how that "object version" thing
> works.
>
> Would "object version" be like a truncated SHA-1 over the same data
> but with different IV or something, i.e. something that guarantees
> anybody would get the same result given the data to be hashed?

Yes, it does need to be that in practice. So what I was thinking the
object version would be is:

 (a) we actually take the object type into account explicitly.

 (b) we explicitly add another truncated hash.

The first part we can already do without any actual data structure
changes, since basically all users already know the type of an object
when they look it up.

So we already have information that we could use to narrow down the
hash collision case if we saw one.

There are some (very few) cases where we don't already explicitly have
the object type (a tag reference can be any object, for example, and
existing scripts might ask for "give me the type of this SHA1 object
with "git cat-file -t"), but that just goes back to the whole "yeah,
we'll handle legacy uses and we will look up objects even _without_
the extra version data, so it actually integrates well into the whole
notion.

Basically, once you accept that "hey, we'll just have a list of
objects with that hash", it just makes sense to narrow it down by the
object type we also already have.

But yes, the object type is obviously only two bits of information
(actually, considering the type distribution, probably just one bit),
and it's already encoded in the first hash, so it doesn't actually
help much as "collision avoidance" particularly once you have a
particular attack against that hash in place.

It's just that it *is* extra information that we already have, and
that is very natural to use once you start thinking of the hash lookup
as returning a list of objects. It also mitigates one of the worst
_confusions_ in git, and so basically mitigates the worst-case
downside of an attack basically for free, so it seems like a
no-brainer.

But the real new piece of object version would be a truncated second
hash of the object.

I don't think it matters too much what that second hash is, I would
say that we'd just approximate having a total of 256 bits of hash.

Since we already have basically 160 bits of fairly good hashing, and
roughly 128 bits of that isn't known to be attackable, we'd just use
another hash and truncate that to 128 bits. That would be *way*
overkill in practice, but maybe overkill is what we want. And it
wouldn't really expand the objects all that much more than just
picking a new 256-bit hash would do.

So you'd have to be able to attack both the full SHA1, _and_ whatever
other different good hash to 128 bits.

Linus

PS.  if people think that SHA1 is of a good _size_, and only worry
about the known weaknesses of the hashing itself, we'd only need to
get back the bits that the attacks take away from brute force. That's
currently the 80 -> ~63 bits attack, so you'd really only want about
40 bits of second hash to claw us back back up to 80 bits of brute
force (again: brute force is basically sqrt() of the search space, so
half the bits, so adding 40 bits of hash adds 20 bits to the brute
force cost and you'd get back up to the 2**80 we started with).

So 128 bits of secondary hash really is much more than we'd need. 64
bits would probably be fine.


Re: SHA1 collisions found

2017-03-02 Thread Linus Torvalds
On Thu, Mar 2, 2017 at 1:54 PM, Joey Hess  wrote:
>
> There's a surprising result of combining iterated hash functions, that
> the combination is no more difficult to attack than the strongest hash
> function used.

Duh. I should actually have known that. I started reading the paper
and went "this seems very familiar". I'm pretty sure I've been pointed
at that paper before (or maybe just a similar one), and I just didn't
react enough for it to leave a lasting impact.

  Linus


Re: SHA1 collisions found

2017-03-02 Thread Linus Torvalds
On Fri, Feb 24, 2017 at 4:39 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Honestly, I think that a primary goal for a new hash implementation
> absolutely needs to be to minimize mixing.
>
> Not for security issues, but because of combinatorics. You want to
> have a model that basically reads old data, but that very aggressively
> approaches "new data only" in order to avoid the situation where you
> have basically the exact same tree state, just _represented_
> differently.
>
> For example, what I would suggest the rules be is something like this:

Hmm. Having looked at this a fair amount, and in particularly having
looked at the code as part of the hash typesafety patch I did, I am
actually starting to think that it would not be too painful at all to
have a totally different approach, which might be a lot easier to do.

So bear with me, let me try to explain my thinking:

 (a) if we want to be backwards compatible and not force people to
convert their trees on some flag day, we're going to be stuck with
having to have the SHA1 code around and all the existing object
parsing for basically forever

Now, this basically means that the _easy_ solution would be that we
just do the flag day, switch to sha-256, extend everything to 32-byte
hashes, and just have a "git2 fast-import" that makes it easy to
convert stuff.

But it really would be a completely different version of git, with a
new pack-file format and no real compatibility. Such a flag-day
approach would certainly have advantages: it would allow for just
re-architecting some bad choices:

 - make the hashing be something that can be threaded (ie maybe we can
just block it up in 4MB chunks that you can hash in parallel, and make
the git object hash be the hash of hashes)

 - replace zlib with something like zstd

 - get rid of old pack formats etc.

but  on the whole, I still think that the compatibility would be worth
much more than the possible technical advantages of a clean slate
restart.

 (b) the SHA1 hash is actually still quite strong, and the collision
detection code effectively means that we don't really have to worry
about collisions in the immediate future.

In other words, the mitigation of the current attack is actually
really easy technically (modulo perhaps the performance concerns), and
there's still nothing fundamentally wrong with using SHA1 as a content
hash. It's still a great hash.

Now, my initial reaction (since it's been discussed for so long
anyway) was obviously "pick a different hash". That was everybody's
initial reaction, I think.

But I'm starting to think that maybe that initial obvious reaction was wrong.

The thing that makes collision attacks so nasty is that our reaction
to a collision is so deadly.  But that's not necessarily fundamental:
we certainly uses hashes with collisions every day, and they work
fine. And they work fine because the code that uses those hashes is
designed to simply deal gracefully - although very possibly with a
performance degradation - with two different things hashing to the
same bucket.

So what if the solution to "SHA1 has potential collisions" is "any
hash can have collisions in theory, let's just make sure we handle
them gracefully"?

Because I'm looking at our data structures that have hashes in them,
and many of them are actually of the type where I go

  "Hmm..  Replacing the hash entirely is really really painful - but
it wouldn't necessarily be all that nasty to extend the format to have
additional version information".

and the advantage of that approach is that it actually makes the
compatibility part trivial. No more "have to pick a different hash and
two different formats", and more of a "just the same format with
extended information that might not be there for old objects".

So we have a few different types of formats:

 - the purely local data structures: the pack index file, the file
index, our refs etc

   These we could in change completely, and it wouldn't even be all
that painful. The pack index has already gone through versions, and it
doesn't affect anything else.

 - the actual objects.

   These are fairly painful to change, particularly things like the
"tree" object which is probably the worst designed of the lot. Making
it contain a fixed-size binary thing was really a nasty mistake. My
bad.

 - the pack format and the protocol to exchange "I have this" information

   This is *really* painful to change, because it contains not just
the raw object data, but it obviously ends up being the wire format
for remote accesses.

and it turns out that *all* of these formats look like they would be
fairly easy to extend to having extra object version information. Some
of that extra object version information we already have and don't
use, in fact.

Even the tree format, with the annoying fixed-size binary blob. Yes,

Re: [PATCH] Put sha1dc on a diet

2017-03-02 Thread Linus Torvalds
On Thu, Mar 2, 2017 at 10:37 AM, Jeff Hostetler  wrote:
>>
>> Now, if your _file_ index is 300-400MB (and I do think we check the
>> SHA fingerprint on that even on just reading it - verify_hdr() in
>> do_read_index()), then that's going to be a somewhat noticeable hit on
>> every normal "git diff" etc.
>
> Yes, the .git/index is 450MB with ~3.1M entries.  verify_hdr() is called
> each time we read it into memory.

Ok. So that's really just a purely historical artifact.

The file index is actually the first part of git to have ever been
written. You can't even see it in the history, because the initial
revision from Apr 7, 2005, obviously depended on the actual object
hashing.

But the file index actually came first. You can _kind_ of see that in
the layout of the original git tree, and how the main header file is
still called "cache.h", and how the original ".git" directory was
actually called ".dircache".

And the two biggest files (by a fairly big margin) are "read-cache.c"
and "update-cache.c".

So that file index cache was in many ways _the_ central part of the
original git model. The sha1 file indexing and object database was
just the backing store for the file index.

But part of that history is then how much I worried about corruption
of that index (and, let's face it, general corruption resistance _was_
one of the primary design goals - performance was high up there too,
but safety in the face of filesystem corruption was and is a primary
issue).

But realistically, I don't think we've *ever* hit anything serious on
the index file, and it's obviously not a security issue. It also isn't
even a compatibility issue, so it would be trivial to just bump the
version header and saying that the signature changes the meaning of
the checksum.

That said:

> We have been testing a patch in GfW to run the verification in a separate 
> thread
> while the main thread parses (and mallocs) the cache_entries.  This does help
> offset the time.

Yeah, that seems an even better solution, honestly.

The patch would be cleaner without the NO_PTHREADS things.

I wonder how meaningful that thing even is today. Looking at what
seems to select NO_PTHREADS, I suspect that's all entirely historical.
For example, you'll see it for QNX etc, which seems wrong - QNX
definitely has pthreads according to their docs, for example.

 Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-02 Thread Linus Torvalds
On Thu, Mar 2, 2017 at 6:45 AM, Johannes Schindelin
 wrote:
>
> It would probably make sense to switch the index integrity check away from
> SHA-1 because we really only care about detecting bit flips there, and we
> have no need for the computational overhead of using a full-blown
> cryptographic hash for that purpose.

Which index do you actually see as being a problem, btw? The main file
index (.git/index) or the pack-file indexes?

We definitely don't need the checking version of sha1 for either of
those, but as Jeff already did the math, at least the pack-file index
is almost negligible, because the pack-file operations that update it
end up doing SHA1 over the objects - and the object SHA1 calculations
are much bigger.

And I don't think we even check the pack-file index hashes except on fsck.

Now, if your _file_ index is 300-400MB (and I do think we check the
SHA fingerprint on that even on just reading it - verify_hdr() in
do_read_index()), then that's going to be a somewhat noticeable hit on
every normal "git diff" etc.

But I'd have expected the stat() calls of all the files listed by that
index to be the _much_ bigger problem in that case. Or do you just
turn those off with assume-unchanged?

Yeah, those stat calls are threaded when preloading, but even so..

Anyway, the file index SHA1 checking could probably just be disabled
entirely (with a config flag). It's a corruption check that simply
isn't that important. So if that's your main SHA1 issue, that would be
easy to fix.

Everything else - like pack-file generation etc for a big clone() may
end up using a ton of SHA1 too, but the SHA1 costs all scale with the
other costs that drown them out (ie zlib, network, etc).

I'd love to see a profile if you have one.

  Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 11:07 AM, Jeff King  wrote:
>
> So obviously the smaller object size is nice, and the diffstat is
> certainly satisfying. My only qualm would be whether this conflicts with
> the optimizations that Dan is working on (probably not conceptually, but
> textually).

Yeah. But I'll happily just re-apply the patch on any new version that
Dan posts.  The patch is obviously trivial, even if size-wise it's a
fair number of lines.

So I wouldn't suggest using the patched version as some kind of
starting point. It's much easier to just take a new version of
upstream and repeat the diet patch on it.

.. and obviously later versions of the upstream sha1dc code may not
even need this at all since Dan and Marc are now aware of the issue.

  Linus


Re: Delta compression not so effective

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsen  wrote:
>
> No, the list of git verify-objects in the previous post was from the bottom
> of the sorted list, so those are the largest blobs, ~249MB..

.. so with a 6GB window, you should easily sill have 20+ objects. Not
a huge window, but it should find some deltas.

But a smaller window - _together_ with a suboptimal sorting choice -
could then result in lack of successful delta matches.

> So, this repo must be knocking several parts of Git's insides. I was curious
> about why it was so slow on the writing objects part, since the whole repo
> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
> has ~400MB/s continuous throughput available.
>
> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
> thread (since the "write objects" stage is single threaded, obviously).

So the writing phase isn't multi-threaded because it's not expected to
matter. But if you can't even generate deltas, you aren't just
*writing* much more data, you're compressing all that data with zlib
too.

So even with a fast disk subsystem, you won't even be able to saturate
the disk, simply because the compression will be slower (and
single-threaded).

> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
> blobs are the same DLLs (multiple of them)

I think the first thing you should test is to repack with fewer
threads, and a bigger pack window. Do somethinig like

  -c pack.threads=4 --window-memory=30g

instead. Just to see if that starts finding deltas.

> Right, now on this machine, I really didn't notice much difference between
> standard zlib level and doing -9. The 203GB version was actually with
> zlib=9.

Don't. zlib has *horrible* scaling with higher compressions. It
doesn't actually improve the end result very much, and it makes things
*much* slower.

zlib was a reasonable choice when git started - well-known, stable, easy to use.

But realistically it's a relatively horrible choice today, just
because there are better alternatives now.

>> Hos sensitive is your material? Could you make a smaller repo with
>> some of the blobs that still show the symptoms? I don't think I want
>> to download 206GB of data even if my internet access is good.
>
> Pretty sensitive, and not sure how I can reproduce this reasonable well.
> However, I can easily recompile git with any recommended
> instrumentation/printfs, if you have any suggestions of good places to
> start? If anyone have good file/line numbers, I'll give that a go, and
> report back?

So the first thing you might want to do is to just print out the
objects after sorting them, and before it starts trying to finsd
deltas.

See prepare_pack() in builtin/pack-objects.c, where it does something like this:

if (nr_deltas && n > 1) {
unsigned nr_done = 0;
if (progress)
progress_state = start_progress(_("Compressing
objects"),
nr_deltas);
QSORT(delta_list, n, type_size_sort);
ll_find_deltas(delta_list, n, window+1, depth, _done);
stop_progress(_state);


and notice that QSORT() line: that's what sorts the objects. You can
do something like

for (i = 0; i < n; i++)
show_object_entry_details(delta_list[i]);

right after that QSORT(), and make that print out the object hash,
filename hash, and size (we don't have the filename that the object
was associated with any more at that stage - they take too much
space).

Save off that array for off-line processing: when you have the object
hash, you can see what the contents are, and match it up wuith the
file in the git history using something like

   git log --oneline --raw -R --abbrev=40

which shows you the log, but also the "diff" in the form of "this
filename changed from SHA1 to SHA1", so you can match up the object
hashes with where they are in the tree (and where they are in
history).

So then you could try to figure out if that type_size_sort() heuristic
is just particularly horrible for you.

In fact, if your data is not *so* sensitive, and you're ok with making
the one-line commit logs and the filenames public, you could make just
those things available, and maybe I'll have time to look at it.

I'm in the middle of the kernel merge window, but I'm in the last
stretch, and because of the SHA1 thing I've been looking at git
lately. No promises, though.

   Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 12:34 PM, Jeff King  wrote:
>
> I don't think that helps. The sha1 over the pack-file takes about 1.3s
> with openssl, and 5s with sha1dc. So we already know the increase there
> is only a few seconds, not a few minutes.

Yeah, I did a few statistics by adding just logging of "SHA1_Init()"
calls. For that network clone situation, the call distribution is

  1SHA1: Init at builtin/index-pack.c:326
 841228SHA1: Init at builtin/index-pack.c:450
  2SHA1: Init at csum-file.c:152
4415756SHA1: Init at sha1_file.c:3218

(the line numbers are a bit off from 'pu', because I obviously have
the logging code).

The big number (one for every object) is from
write_sha1_file_prepare(), which we'd want to be the strong collision
checking version because those are things we're about to create git
objects out of. It's called from

 - hash_sha1_file() - doesn't actually write the object, but is used
to calculate the sha for incoming data after applying the delta, for
example.

 - write_sha1_file() - many uses, actually writes the object

 - hash_sha1_file_literally() - git hash-object

and that index-pack.c:450 is from unpack_entry_data() for the base
non-delta objects (which should also be the strong kind).

So all of them should check against collision attacks, so none of them
seem to be things you'd want to optimize away..

So I was wrong in thinking that there were a lot of unnecessary SHA1
calculations in that load. They all look like they should be done with
the slower checking code.

Oh well.

  Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 2:51 PM, Johannes Schindelin
 wrote:
>
> But I think bigger than just developers on Windows OS. There are many
> developers out there working on large repositories (yes, much larger than
> Linux). Also using Macs and Linux. I am not at all sure that we want to
> give them an updated Git they cannot fail to notice to be much slower than
> before.

Johannes, have you *tried* the patches?

I really don't think you have. It is completely unnoticeable in any
normal situation. The two cases that it's noticeable is:

 - a full fsck is noticeable slower

 - a full non-local clone is slower (but not really noticeably so
since the network traffic dominates).

In other words, I think you're making shit up. I don't think you
understand how little the SHA1 performance actually matters. It's
noticeable in benchmarks. It's not noticeable in any normal operation.

.. and yes, I've actually been running the patches locally since I
posted my first version (which apparently didn't go out to the list
because of list size limits) and now running the version in 'pu'.

Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 1:56 PM, Johannes Schindelin
 wrote:
> Footnote *1*: I know, it is easy to forget that some developers cannot
> choose their tools, or even their hardware. In the past, we seemed to take
> appropriate care, though.

I don't think you need to worry about the Windows side. That can
continue to do something else.

When I advocated perhaps using  USE_SHA1DC by default, I definitely
did not mean it in a "everywhere, regardless of issues" manner.

For example, the conmfig.mak.uname script already explicitly asks for
"BLK_SHA1 = YesPlease" for Windows. Don't bother changing that, it's
an explicit choice.

But the Linux rules don't actually specify which SHA1 version to use,
so the main Makefile currently defaults to just using openssl.

So that's the "default" choice I think we might want to change. Not
the "we're windows, and explicitly want BLK_SHA1 because of
environment and build infrastructure".

 Linus


Re: Delta compression not so effective

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen  wrote:
>
> Indeed, I did do a
> -c pack.threads=20 --window-memory=6g
> to 'git repack', since the machine is a 20-core (40 threads) machine with
> 126GB of RAM.
>
> So I guess with these sized objects, even at 6GB per thread, it's not enough
> to get a big enough Window for proper delta-packing?

Hmm. The 6GB window should be plenty good enough, unless your blobs
are in the gigabyte range too.

> This repo took >14hr to repack on 20 threads though ("compression" step was
> very fast, but stuck 95% of the time in "writing objects"), so I can only
> imagine how long a pack.threads=1 will take :)

Actually, it's usually the compression phase that should be slow - but
if something is limiting finding deltas (so that we abort early), then
that would certainly tend to speed up compression.

The "writing objects" phase should be mainly about the actual IO.
Which should be much faster *if* you actually find deltas.

> But arent't the blobs sorted by some metric for reasonable delta-pack
> locality, so even with a 6GB window it should have seen ~25 similar objects
> to deltify against?

Yes they are. The sorting for delta packing tries to make sure that
the window is effective. However, the sorting is also just a
heuristic, and it may well be that your repository layout ends up
screwing up the sorting, so that the windows just work very badly.

For example, the sorting code thinks that objects with the same name
across the history are good sources of deltas. But it may be that for
your case, the binary blobs that you have don't tend to actually
change in the history, so that heuristic doesn't end up doing
anything.

The sorting does use the size and the type too, but the "filename
hash" (which isn't really a hash, it's something nasty to give
reasonable results for the case where files get renamed) is the main
sort key.

So you might well want to look at the sorting code too. If filenames
(particularly the end of filenames) for the blobs aren't good hints
for the sorting code, that sort might end up spreading all the blobs
out rather than sort them by size.

And again, if that happens, the "can I delta these two objects" code
will notice that the size of the objects are wildly different and
won't even bother trying. Which speeds up the "compressing" phase, of
course, but then because you don't get any good deltas, the "writing
out" phase sucks donkey balls because it does zlib compression on big
objects and writes them out to disk.

So there are certainly multiple possible reasons for the deltification
to not work well for you.

Hos sensitive is your material? Could you make a smaller repo with
some of the blobs that still show the symptoms? I don't think I want
to download 206GB of data even if my internet access is good.

Linus


Re: [PATCH] Put sha1dc on a diet

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 10:42 AM, Junio C Hamano  wrote:
> I see //c99 comments

sha1dc is already full of // style comments. I just followed the
existing practice.

> and also T array[] = { [58] = val } both of
> which I think we stay away from (and the former is from the initial
> import), so some people on other platforms MAY have trouble with
> this topic.

Hmm. The "{ [58] = val; }" kind of initialization would be easy to
work around by just filling in everything else with NULL, but it would
make for a pretty nasty readability issue.

That said, if you mis-count the NULL's, the end result will pretty
immediately SIGSEGV, so I guess it wouldn't be much of a maintenance
problem.

But if you're just willing to take the "let's see" approach, I think
the explicitly numbered initializer is much better.

The main people who I assume would really want to use the sha1dc
library are hosting places. And they won't be using crazy compilers
from the last century.

That said, I think that it would be lovely to just default to
USE_SHA1DC and just put the whole attack behind us. Yes, it's slower.
No, it doesn't really seem to matter that much in practice.

  Linus


Re: Delta compression not so effective

2017-03-01 Thread Linus Torvalds
On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen  wrote:
>
> When first importing, I disabled gc to avoid any repacking until completed.
> When done importing, there was 209GB of all loose objects (~670k files).
> With the hopes of quick consolidation, I did a
> git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
>   -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
>   -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
>   gc --prune
> which brought it down to 206GB in a single pack. I then ran
> git repack -a -d -F --window=350 --depth=250
> which took it down to 203GB, where I'm at right now.

Considering that it was 209GB in loose objects, I don't think it
delta-packed the big objects at all.

I wonder if the big objects end up hitting some size limit that causes
the delta creation to fail.

For example, we have that HASH_LIMIT  that limits how many hashes
we'll create for the same hash bucket, because there's some quadratic
behavior in the delta algorithm. It triggered with things like big
files that have lots of repeated content.

We also have various memory limits, in particular
'window_memory_limit'. That one should default to 0, but maybe you
limited it at some point in a config file and forgot about it?

 Linus


[PATCH] Put sha1dc on a diet

2017-02-28 Thread Linus Torvalds

From: Linus Torvalds <torva...@linux-foundation.org>
Date: Tue, 28 Feb 2017 16:12:32 -0800
Subject: [PATCH] Put sha1dc on a diet

This removes the unnecessary parts of the sha1dc code, shrinking things from

[torvalds@i7 git]$ size sha1dc/*.o
   textdata bss dec hex filename
 277559 640   0  278199   43eb7 sha1dc/sha1.o
   4438   11352   0   157903dae sha1dc/ubc_check.o

to

[torvalds@i7 git]$ size sha1dc/*.o
   textdata bss dec hex filename
  13287   0   0   1328733e7 sha1dc/sha1.o
   4438   11352   0   157903dae sha1dc/ubc_check.o

so the sha1.o text size shrinks from about 271kB to about 13kB.

The shrinking comes mainly from only generating the recompressio functions 
for the two rounds that are actually used (58 and 65), but also from 
removing a couple of other unused functions. The sha1dc library lost its 
"safe_hash" parameter to do that, since we check - and refuse to touch - 
the colliding cases manually.

The git binary itself is about 2MB of text on my system. For other helper 
binaries the size reduction is even more noticeable.  A quarter MB here 
and a quarter MB there, and suddenly you have a big binary ;)

This has been tested with the bad pdf image:

[torvalds@i7 git]$ ./t/helper/test-sha1 < ~/Downloads/bad.pdf 
fatal: The SHA1 computation detected evidence of a collision attack;
refusing to process the contents.

although we obviously still don't have an actual git object to test with.

The normal git test-suite obviously also passes.

Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---

I notice that sha1dc is in the 'pu' branch now, so let's put my money 
where my mouth is, and send in the sha1dc diet patch.

 sha1dc/sha1.c | 356 +++---
 sha1dc/sha1.h |  24 
 2 files changed, 18 insertions(+), 362 deletions(-)

diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
index 6569b403e..4910f0c35 100644
--- a/sha1dc/sha1.c
+++ b/sha1dc/sha1.c
@@ -9,6 +9,15 @@
 #include "sha1dc/sha1.h"
 #include "sha1dc/ubc_check.h"
 
+// function type for sha1_recompression_step_T (uint32_t ihvin[5], uint32_t 
ihvout[5], const uint32_t me2[80], const uint32_t state[5])
+// where 0 <= T < 80
+//   me2 is an expanded message (the expansion of an original message 
block XOR'ed with a disturbance vector's message block difference)
+//   state is the internal state (a,b,c,d,e) before step T of the SHA-1 
compression function while processing the original message block
+// the function will return:
+//   ihvin: the reconstructed input chaining value
+//   ihvout: the reconstructed output chaining value
+typedef void(*sha1_recompression_type)(uint32_t*, uint32_t*, const uint32_t*, 
const uint32_t*);
+
 #define rotate_right(x,n) (((x)>>(n))|((x)<<(32-(n
 #define rotate_left(x,n)  (((x)<<(n))|((x)>>(32-(n
 
@@ -39,212 +48,14 @@
 
 
 
-void sha1_message_expansion(uint32_t W[80])
+static void sha1_message_expansion(uint32_t W[80])
 {
unsigned i;
for (i = 16; i < 80; ++i)
W[i] = rotate_left(W[i - 3] ^ W[i - 8] ^ W[i - 14] ^ W[i - 16], 
1);
 }
 
-void sha1_compression(uint32_t ihv[5], const uint32_t m[16])
-{
-   uint32_t W[80];
-   uint32_t a, b, c, d, e;
-   unsigned i;
-
-   memcpy(W, m, 16 * 4);
-   for (i = 16; i < 80; ++i)
-   W[i] = rotate_left(W[i - 3] ^ W[i - 8] ^ W[i - 14] ^ W[i - 16], 
1);
-
-   a = ihv[0];
-   b = ihv[1];
-   c = ihv[2];
-   d = ihv[3];
-   e = ihv[4];
-
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 0);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 1);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 2);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 3);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 4);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 5);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 6);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 7);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 8);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 9);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 10);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 11);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 12);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d, e, a, b, W, 13);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(b, c, d, e, a, W, 14);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(a, b, c, d, e, W, 15);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(e, a, b, c, d, W, 16);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(d, e, a, b, c, W, 17);
-   HASHCLASH_SHA1COMPRESS_ROUND1_STEP(c, d,

Re: Typesafer git hash patch

2017-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2017 at 12:19 PM, brian m. carlson
 wrote:
>
> The bigger issue is the assumptions in the code base that assume a given
> hash size.

Absolutely. And I think those are going to be the "real" patches.

I actually saw your status report about

  "After another 27 commits, I've got it down from 1244 to 1119."

when I had just started, and I applauded the approach, but the numbers
made me go "ugh, that sounds painful".

What I actually wanted to do was to write a coccinelle script to
automate it, but it wasn't entirely obvious, and I started out just
doing it with "sed" and hand-fixing, and it got tedious byt then I had
done X% and just kept going.

I think an automated script that also guarantees that no code
generation could possibly change would have been lovely. We do that
over the kernel occasionally, where those kinds of patches that are
four thousand insertions/deletions in size are called "last Tuesday".

So my approach was really just brute-force and stupid, I admit it. I'm
not proud of the patch either.

It might be one of those "bandaid removal moments" - get it over and
done with and just rip that thing off ;)

So the patch is

 216 files changed, 4174 insertions(+), 4080 deletions(-)

and about 27k lines and 950kB total size.

The list limits are apparently much lower than that, but I'll happily
send it to people in private.

   Linus


Re: SHA1 collisions found

2017-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2017 at 2:50 PM, Marc Stevens  wrote:
>
> Because we only have 32 disturbance vectors to check, we have DVMASKSIZE
> equal to 1 and maski always 0.
> In the more general case when we add disturbance vectors this will not
> remain the case.

Ok, I didn't get why that happened, but it makes sense to me now.

> Of course for dedicated code this can be simplified, and some parts
> could be further optimized.

So I'd be worried about changing your tested code too much, since the
only test-cases we have are the two pdf files. If we screw up too
much, those will no longer show as collisions, but we could get tons
of false positives that we wouldn't see, so..

I'm wondering that since the disturbance vector cases themselves are a
fairly limited number, perhaps the code that generates this could be
changed to actually just generate the static calls rather than the
loop over the sha1_dvs[] array.

IOW, instead of generating this:

for (i = 0; sha1_dvs[i].dvType != 0; ++i) {
.. use sha1_dvs[i] values as indexes into function arrays etc..
}

maybe you could just make the generator generate that loop statically,
and have 32 function calls with the masks as constant arguments.

.. together with only generating the SHA1_RECOMPRESS() functions for
the cases that are actually used.

So it would still be entirely generated, but it would generate a
little bit more explicit code.

Of course, we could just edit out all the SHA_RECOMPRESS(x) cases by
hand, and only leave the two that are actually used.

As it is, the lib/sha1.c code generates about 250kB of code that is
never used if I read the code correctly (that's just the sha1.c code -
entirely ignoring all the avx2 etc versions that I haven't looked at,
and that I don't think git would use)

> Regarding the recompression functions, the ones needed are given in the
> sha1_dvs table,
> but also via preprocessor defines that are used to actually only store
> needed internal states:
> #define DOSTORESTATE58
> #define DOSTORESTATE65

Yeah, I guess we could use those #define's to cull the code "automatically".

But I think you could do it at the generation phase easily too, so
that we don't then introduce unnecessary differences when we try to
get rid of the extra fat ;)


> Note that as each disturbance vector has its own unique message differences
> (leading to different values for ctx->m2), our code does not loop over
> just 2 items.
> It loops over 32 distinct computations which have either of the 2
> starting points.

Yes, I already had to clarify my badly expressed writing on the git
list. My concerns were about the (very much not obvious) limited
values in the dvs array.

So the code superficially *looks* like it uses all those functions you
generate (and the maski value _looked_ like it was interesting), but
when looking closer it turns out that there's just a two different
function calls that it loops over (but it loops over them multiple
times, I agree).

   Linus


  1   2   3   4   5   6   >