Re: Re: Merge with git-pasky II.
* Linus Torvalds [EMAIL PROTECTED] wrote: On Sun, 17 Apr 2005, Ingo Molnar wrote: in fact, this attack cannot even be proven to be malicious, purely via the email from Malice: it could be incredible bad luck that caused that good-looking patch to be mistakenly matching a dangerous object. I really hate theoretical discussions. i was only replying to your earlier point: Almost all attacks on sha1 will depend on _replacing_ a file with a bogus new one. So guys, instead of using sha256 or going overboard, just make sure that when you synchronize, you NEVER import a file you already have. which point i still believe is subtly wrong. You were suggesting to concentrate on file replacement to counter most of the practical attacks, while i pointed out an attack _using the same basic mechanism that your point above supposed_. [ if you can replace a file with a known hash, with a bogus new one, and you still have enough control over the contents of your bogus new file that it is 1) a valid file that builds 2) compromises the kernel, then you likely have the same amount of control my 'theoretical' attack requires. ] And the thing is, _if_ somebody finds a way to make sha1 act as just a complex parity bit, and comes up with generating a clashing object that actually makes sense, then going to sha256 is likely pointless too [...] yes, that's why i suggested to not actually trust the hash to be cryptographically secure, but to just assume it's a good generic hash we can design a DB around, and to turn -DCOLLISION_CHECK on and enforce consistency rules on boundaries. [ it's not bad to keep sha1 because even my suggested enhancement still leaves 'content-less trust-pointers to untrusted content via email' vectors open against attack (maintainer sends you an email that commit X in Malice's repository Y is fine to pull, and you pull it blindly, while the attacker has replaced his content with the compromised one meanwhile), but it at least validates the bulk traffic that goes into the DB: patches via emails and trusted repositories. ] so all i was suggesting was to extend your suggested 'overwrite collision check' to a stricter 'content we throw away and use the sha1 shortcut for needs to be checked against the in-DB content as well'. in other words, your suggested 'rename check' is checking for 'positive duplicate content', while my addition would also check for 'negative duplicate content' as well. but as usual, i could be wrong, so dont take this too serious :-) Ingo - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Sat, 2005-04-16 at 17:33 +0200, Johannes Schindelin wrote: But if it can be done cheaply enough at a later date even though we end up repeating ourselves, and if it can be done _well_ enough that we shouldn't have just asked the user in the first place, then yes, OK I agree. The repetition could be helped by using a cache. Perhaps. Since neither such a cache nor even the commit comments are strictly part of the git data, they probably shouldn't be included in the sha1 hash of the commit object. However, I don't see a fundamental reason why we couldn't store them in the same file but omit them from the hash calculations. That also allows us to retrospectively edit commit comments without completely changing the entire subsequent history. Or is that a little too heretical a suggestion? -- dwmw2 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Merge with git-pasky II.
* Ingo Molnar [EMAIL PROTECTED] wrote: The compromise relies on you having reviewed something harmless, while in reality what happened within the DB was far less harmless. And the DB remains self-consistent: neither fsck, nor others importing your tree will be able to detect the compromise. This attack can only be detected when you apply the patch, after that point all the information (except Malice's message in your inbox) is gone. in fact, this attack cannot even be proven to be malicious, purely via the email from Malice: it could be incredible bad luck that caused that good-looking patch to be mistakenly matching a dangerous object. In fact this could happen even today, _accidentally_. (but i'm willing to bet that hell will be freezing over first, and i'll have some really good odds ;) There's probably a much higher likelyhood of Linus' tree getting corrupted in some old fashioned way and introducing a security hole by accident) Ingo - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter where Herbert Xu [EMAIL PROTECTED] told me that... I get the feeling that it isn't that bad. For example, if we did it at the points where the blobs actually entered the tree, then the cost is always proportional to the change size (the number of new blobs). No. The collision check is done in the opposite cache - when you want to write a blob and there is already a file of the same hash in the tree. So either the blob is already in the database, or you have a collision. Therefore, the cost is proportional to the size of what stays unchanged. -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Mon, 18 Apr 2005, Herbert Xu wrote: I wasn't disputing that of course. However, the same effect can be achieved in using a single hash with a bigger length, e.g., sha256 or sha512. No it cannot. If somebody actually literally totally breaks that hash, length won't matter. There are (bad) hashes where you can literally edit the content of the file, and make sure that the end result has the same hash. In that case, when the hash algorithm has actually been broken, the length of the hash ends up being not very relevant. For example, you might hash your file by blocking it up in 16-byte blocks, and xoring all blocks together - the result is a 16-byte hash. It's a terrible hash, and obviously trivially breakable, and once broken it does _not_ help to make it use its 32-byte cousin. Not at all. You can just modify the breaking thing to equally cheaply make modifications to a file and get the 32-byte hash right again. Is that kind of breakage likely for sha1? Hell no. Is it possible? In your in theory world where practice doesn't matter, yes. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
Petr Baudis wrote: Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter where Herbert Xu [EMAIL PROTECTED] told me that... I get the feeling that it isn't that bad. For example, if we did it at the points where the blobs actually entered the tree, then the cost is always proportional to the change size (the number of new blobs). No. The collision check is done in the opposite cache - when you want to write a blob and there is already a file of the same hash in the tree. So either the blob is already in the database, or you have a collision. Therefore, the cost is proportional to the size of what stays unchanged. ?? now I'm confused. Surly the only cost involved is to never write over a file that already exist in the cache and that is already done NOW as far as I read the code. So there is NO extra cost in detecting an collision. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
Dear diary, on Mon, Apr 18, 2005 at 02:49:06AM CEST, I got a letter where Herbert Xu [EMAIL PROTECTED] told me that... Therefore the only conclusion I can draw is that we're only calling update-cache on the set of changed files, or at most a small superset of them. In that case, the cost of the collision check *is* proportional to the size of the change. Yes, of course, sorry for the confusion. We only consider files you either specify manually or which have their stat metadata changed relative to the directory cache. (That is from the git-pasky perspective; from the plumbing perspective, the user just does update-cache on whatever he picks.) -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote: In other words, I'm right. I'm always right, but sometimes I'm more right than other times. And dammit, when I say files don't matter, I'm really really Right(tm). You're right, of course (All Hail Linus!), if you can make it work efficiently enough. Just to put something else on the table, here's how I'd go about tracking renames and the like, in another world where Linus /does/ make the odd mistake - it's basically a unique id for files in the repository, added when the file is first recognised and updated when update-cache adds a new version to the cache. Renames copy the id across to the new name, and add it into the cache. This gives you an O(n) way to tell what file was what across renames, and it might even be useful in Linus' world, or if someone wanted to build a traditional SCM on top of a git-a-like. Attached is a patch, and a rename-file.c to use it. Simon given that you have multiple machines creating files, how do you deal with the idea of the same 'unique id' being assigned to different files by different machines? David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
Hi, On Fri, 15 Apr 2005, David Woodhouse wrote: But if it can be done cheaply enough at a later date even though we end up repeating ourselves, and if it can be done _well_ enough that we shouldn't have just asked the user in the first place, then yes, OK I agree. The repetition could be helped by using a cache. Ciao, Dscho - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Merge with git-pasky II.
On Sat, Apr 16, 2005 at 06:03:33PM +0200, Petr Baudis wrote: Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter where Simon Fowler [EMAIL PROTECTED] told me that... On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote: Simon given that you have multiple machines creating files, how do you deal with the idea of the same 'unique id' being assigned to different files by different machines? The id is a sha1 hash of the current time and the full path of the file being added - the chances of that being replicated without malicious intent is extremely small. There are other things that could be used, like the hostname, username of the person running the program, etc, but I don't really see them being necessary. Why not just use UUID? Hey, everything else in git seems to use sha1, so I just copied Linus' sha1 code ;-) All I wanted was something that had a good chance of being unique across any potential set of distributed repositories, to avoid the chance of accidental clashes. A sha1 hash of something that's not likely to be replicated is a simple way to do that. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ signature.asc Description: Digital signature
Re: Merge with git-pasky II.
CL == Christopher Li [EMAIL PROTECTED] writes: - Result is this object $SHA1 with mode $mode at $path (takes one of the trees); you can do update-cache --cacheinfo (if you want to muck with dircache) or cat-file blob (if you want to get the file) or both. CL Is that SHA1 for tree or the file object? I am talking about a single file here. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Fri, Apr 15, 2005 at 12:43:47AM -0700, Junio C Hamano wrote: CL == Christopher Li [EMAIL PROTECTED] writes: CL Is that SHA1 for tree or the file object? I am talking about a single file here. Then do you emit the entry for it's parents directory? e.g. /foo/bar get created. foo doesn't exists. You have to create foo first. You don't have mode information for foo yet. If it give the top level tree, the SCM can check it out by tree. hopefully have the mode on directory correctly. Well, if they care about those little details. Chris - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote: do such cases occur frequently? In the kernel at least it's not too typical. Isn't it? I thought it was a fairly accurate representation of the process I make a whole bunch of changes to files I maintain, pulling from Linus while occasionally asking him to pull from my tree. Sometimes my files are changed by someone else in Linus' tree, and sometimes I change files that I don't actually own.. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
On Fri, Apr 15, 2005 at 02:03:08PM +0200, Johannes Schindelin wrote: I disagree. In order to be trusted, this thing has to catch the following scenario: Skywalker and Solo start from the same base. They commit quite a lot to their trees. In between, Skywalker commits a tree, where the function kazoom() has been added to the file deathstar.c, but Solo also added this function, but to the file moon.c. A file-based merge would have no problem merging each file, such that in the end, kazoom() is defined twice. The same problems arise when one tries to merge line-wise, i.e. when for each line a (possibly different) merge-parent is sought. Be careful. There is a very big tradeoff between 100% perfections in catching these sorts of errors, and usability. There exists SCM's where you are not allowed to do commit such merges until you do a test compile, or run a regression test suite (that being the only way to catch these sorts of problems when we merge two branches like this). BitKeeper never caught this sort of thing, and we trusted it. In practice it was also rarely a problem. I'll also note that BitKeeper doesn't restrict you from doing a committing a changeset when you have modified files that have yet to be checked in to the tree. Same issue; you can accidentally check in changesets that in trees that won't build, but if we added this kind of SCM-by-straightjacket philosophy it would decrease our productivity and people would simply not use such an SCM, thus negating its effectiveness. - Ted - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
PB == Petr Baudis [EMAIL PROTECTED] writes: PB I can't see the conflicts between what I want and what Linus wants. PB After all, Linus says that I can use the directory cache in any way I PB please (well, the user can, but I'm speaking for him ;-). So I'm doing PB so, and with your tool I would get into problems, since it is suddenly PB imposing a policy on what should be in the index. I think our misunderstanding is coming from the use of the word merge tree. I think you have been assuming that I wanted you to run merge-trees -o ,,merge --- which would certainly cause me to muck with your dircache there. I totally agree with you that that is a *BAD* *THING*. No question there. However, my assumption has been different. I was assuming that you would run merge-trees -o merge~tree (i.e. different from your merge tree), so that you can get the merge results in a form parsable by you. And then, using that information, you can make your changes in ,,merge. After you are done with that information, you can remove merge~trees, of course. The format I chose for the merge result in a form parsable by you happens to be a dircache in merge~tree, with minimum number of files checked out when merge cannot be automatically done safely. In the simplest case of not having any conflicting merge between $C and $merged, Cogito can immediately run write-tree in merge~tree (not ,,merge) to obtain its tree-ID $T, so that it can feed it to diff-tree to compare it with whatever tree state Cogito wants to apply the merges between $C and $merged to. I still do not understand what you do in ,,merge directory, but here is one way you can update the user working directory in-place without having a ,,merge directory [*2*]. You can run your git diff between $C and $T [*1*]. The result is the diff you need to apply on top of your user's working files. If the user does not like the result of running that diff, it can easily be reversed. If a manual merge were needed between $C and $merged, Cogito could guide the user through that manual edit in merge~tree, and run update-cache on those hand merged files in merge~tree, before running write-tree in merge~tree to obtain $T; after that, everything else is the same. You make interesting points in other parts of your message I need to regurgitate for a while, so I would not comment on them in this message. [Footnote] *1* I really like the convenience of being able to use tree-ID and commit-ID interchangeably there. Thanks. *2* I understand that this would change the user's git-tools experience a bit. The user will not be told to go to ,,merge and commit there which will reflected back to your working tree anymore. Instead the merge happens in-place. Committing, not committing, or further hand-fixing the merge is up to the user. I suspect this change might even be for the better. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Merge with git-pasky II.
PB == Petr Baudis [EMAIL PROTECTED] writes: PB Bah, you outran me. ;-) Just being in a different timezone, I guess. PB I'll change it to use the cool git-pasky stuff (commit-id etc) and its PB style of committing - that is, it will merely record the update-caches PB to be done upon commit, and it will read-tree the branch we are merging PB to instead of the ancestor. (So that git diff gives useful output.) Sorry, I have not seen what you have been doing since pasky 0.3, and I have not even started to understand the mental model of the world your tool is building. That said, my gut feeling is that telling this script about git-pasky's world model might be a mistake. I'd rather see you consider the script as mere part of the plumbing. Maybe adding an extra parameter to the script to let the user explicitly specify the common ancestor to use would be needed, but I would prefer git-pasky-merge to do its own magic (converting symbolic commit names into raw commit names and such) before calling this low level script. That way people like me who have not migrated to your framework can still keep using it. All the script currently needs is a bare git object database; i.e., nothing other than what is in .git/objects and a couple of commit record SHA1s as its parameters. No .git/heads/, no .git/HEAD.local, no .git/tags, are involved for it to work, and I would prefer to keep things that way if possible. * show-diff updates to add -r flag to squelch diffs for files not in the working directory. This is mainly useful when verifying the result of an automated merge. PB -r traditionally means recursive - what's the reasoning behind the PB choice of this letter? Well, '-r' is not necessarily recursive. ls -r is reverse, sort -r is reverse. less -r is raw. cat -r is reversible. nethack -r is race ;-). You are thinking as an SCM person so it may look that way. diff -r is recursive. darcs add -r is recursive. But even in the SCM world, cvs add -r is not (it means read-only) neither co -r (explicit revision) ;-). I would rather pick '-q' if I were doing the patch today, but I was too tired and did not think of a letter when I wrote it. I guess '-r' stood for removed, but I agree it is a bad choice. Any objections to '-q'? PB use strict? Not in this iteration but eventually yes. PB It'd be simpler to do just PB my @common = (map { $common{$_} } PB sort { $b = $a } PB keys %common) Well, actually you spotted a bug between the implementation and what I wanted to do. It should have been: map { $_-[0] } sort { $b-[1] = $a-[1] } map { [ $common{$_} = $_ ] } keys %common That is, sort [anscestor = number of times it appears] tuple by the number of times it appears in decreasing order, and project the resulting list to a list of ancestors. It is trying to deal with the following pattern in rev-tree output: TIMESTAMP1 EDGE1:1 ANCESTOR1:3 ANCESTOR2:3 TIMESTAMP2 EDGE2:2 ANCESTOR1:3 and when the above happens I wanted to pick up ANCESTOR1, but that was without no sound reason. PB But I really think this is a horrible heuristic. I believe you should PB take the latest commit in the --edges output, and from that choose the PB base whose rev-tree --edges the_base merged_branch has the least lines PB on output. (That is, the path to it is shortest - ideally it's already PB part of the merged_branch.) I'll try something along that line. Honestly the ancestor selection part was what I had most trouble with. Thanks. PB What about PB sub OLDMODE { 0 } PB sub NEWMODE { 1 } PB sub NEWSHA { 2 } PB and then using that when accessing the tuple? Would make the code PB much more readable. Totally agreed; readability cleanup is needed, just as use strict you mentioned, before it is ready for public consumption. Remember, however, the primary purpose of the message was to share it with Linus so that I can ask his opinion while the script was still slushy; the contents that array contained was still changing then and was too early for symbolic constants. I'll do that in the next round. PB It is a good idea to check merge's exit code and give a notice at the PB end if there were any conflicts. In principle yes, but I noticed that merge already gave me a nice warning message when it found conflicts, so there was no need to do so myself in this case. See sample output: $ perl ./git-merge.perl \ 71796686221a0a56ccc25b02386ed8ea648da14d \ bb95843a5a0f397270819462812735ee29796fb4 Common ancestor: 9f02d4d233223462d3f6217b5837b786e6286ba4 O - COPYING O - README ... O - write-tree.c A M write-blob.c A M show-diff.c ... A M update-cache.c A M git-merge.perl B M merge-tree.c MRG Makefile merge: warning: conflicts during merge $ +# Create a temporary directory and go there. +system 'rm', '-rf', ',,merge-temp'; PB Can't we call it just ,,merge? I'd rather have a command line option
Re: Re: Re: Merge with git-pasky II.
Dear diary, on Thu, Apr 14, 2005 at 10:23:26PM CEST, I got a letter where Erik van Konijnenburg [EMAIL PROTECTED] told me that... On Thu, Apr 14, 2005 at 09:35:07PM +0200, Petr Baudis wrote: Hmm. I actually don't like this naming. I think it's not too consistent, is irregular, therefore parsing it would be ugly. What I propose: 12c\tname - legend - original file D - tree #1 removed file D- tree #2 removed file DD- both trees removed file M - tree #1 modified file M DM* - conflict, tree #1 removed file, tree #2 modified file MD* MM- exact same modification MM* - different modifications, merging This is generic, theoretically scales well even to more trees, is easy to parse trivially, still is human readable (actually the asterisk in the 'conflict' column is there basically only for the humans), is completely regular and consistent. Detail: perhaps use underscore instead of space, to avoid space/tab typos that are invisible on paper and user friendly mail clients? I'd go for dots in that case. Looks less intrusive. :^) -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Merge with git-pasky II.
Is that some thing you want to see? Maybe clean up the error printing. Chris --- /dev/null 2003-01-30 05:24:37.0 -0500 +++ merge.py2005-04-14 16:34:39.0 -0400 @@ -0,0 +1,76 @@ +#!/usr/bin/env python + +import re +import sys +import os +from pprint import pprint + +def get_tree(commit): +data = os.popen(cat-file commit %s%commit).read() +return re.findall(r(?m)^tree (\w+), data)[0] + +PREFIX = 0 +PATH = -1 +SHA = -2 +ORIGSHA = -3 + +def get_difftree(old, new): +lines = os.popen(diff-tree %s %s%(old, new)).read().split(\x00) +patterns = (r(\*)(\d+)-(\d+)\s(\w+)\s(\w+)-(\w+)\s(.*), + r([+-])(\d+)\s(\w+)\s(\w+)\s(.*)) +res = {} +for l in lines: + if not l: continue + for p in patterns: + m = re.findall(p, l) + if m: + m = m[0] + res[m[-1]] = m + break + else: + raise difftree: unknow line, l +return res + +def analyze(diff1, diff2): +diff1only = [ diff1[k] for k in diff1 if k not in diff2 ] +diff2only = [ diff2[k] for k in diff2 if k not in diff1 ] +both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ] + +action(diff1only) +action(diff2only) +action_two(both) + +def action(diffs): +for act in diffs: + if act[PREFIX] == *: + print modify, act[PATH], act[SHA] + elif act[PREFIX] == '-': + print remove, act[PATH], act[SHA] + elif act[PREFIX] == '+': + print add, act[PATH], act[SHA] + else: + raise unknow action + +def action_two(diffs): +for act1, act2 in diffs: + if len(act1) == len(act2): # same kind type + if act1[PREFIX] == act2[PREFIX]: + if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': + return action(act1) + if act1[PREFIX]=='*': + print do_merge, act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA] + return + print unable to handle, act[PATH] + print one side wants, act1[PREFIX] + print the other side wants, act2[PREFIX] + + +args = sys.argv[1:] +if len(args)!=3: +print Usage merge.py common rev1 rev2 +trees = map(get_tree, args) +print checkout-tree, trees[0] +diff1 = get_difftree(trees[0], trees[1]) +diff2 = get_difftree(trees[0], trees[2]) +analyze(diff1, diff2) + - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html