Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...
On 04/06/2012 10:47 AM, Greg Stein wrote: Correct. Still useful, but even if memory is compromised, the SHA1 is not reversible. The original MP cannot be recovered for other uses. Just as a reminder, SHA-1 is not recommended for use in new applications at this time (http://csrc.nist.gov/groups/ST/hash/policy.html).
Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...
On 04/06/2012 10:55 AM, Greg Stein wrote: In other words, changing the master passphrase only requires decrypting and re-encrypting one 256-bit encryption key, not the whole credentials store. PKBDF2 is in the current design to make dict attacks computationally impossible. Assuming we keep that, then the above value would be fed in as the secret to PKBDF2, rather than MP or sha1(MP) ? If I understand you correctly, that wouldn't make sense. PBKDF2 is designed to provide some resistance against offline dictionary attacks against a weak secret, at the cost of computational power for legitimate users. If you have a strong secret, there's no point in running it through PBKDF2. Under the suggested architecture, you'd use PBKDF2(MP) to decrypt the master key, and then use the master key to decrypt the individual passwords. I also want to caution that PBKDF2 does not provide strong protection against offline dictionary attacks. Most cryptographic methods provide exponential protection--I do a little bit more work to make you do twice as much work. PBKDF2 provides only linear protection--I do twice as much work to make you do twice as much work. It does not make dictionary attacks impossible in the same sense that AES-128 makes decryption without knowing the key impossible. If a system can be designed to prevent offline dictionary attacks entirely, that's much better. But for this application, that's probably impossible, since it's easy to distinguish a valid result (a password, which will be printable ASCII) from garbage.
Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...
On 04/06/2012 01:44 PM, Justin Erenkrantz wrote: On Fri, Apr 6, 2012 at 8:09 AM, Greg Hudson ghud...@mit.edu wrote: I also want to caution that PBKDF2 does not provide strong protection against offline dictionary attacks. Most cryptographic methods provide exponential protection--I do a little bit more work to make you do twice as much work. PBKDF2 provides only linear protection--I do twice as much work to make you do twice as much work. It does not make dictionary attacks impossible in the same sense that AES-128 makes decryption without knowing the key impossible. Is it worth looking at scrypt[1] instead of PBKDF2? -- justin Possibly. It depends on whether you care about things like NIST review (PBKDF2 is recommended in NIST SP 800-132) versus the theoretical advantages of a less heavily scrutinized algorithm. That's always a tough choice. The fundamental nature of scrypt isn't different from the fundamental nature of PBKDF2; both seek to add a fixed multiplier to the cost of both the legitimate user and the attacker. scrypt is designed to make it more difficult to use massively parallel hardware to mount the attack, by requiring more memory (if I skimmed the paper correctly).
Re: [Issue 4145] Master passphrase and encrypted credentials cache
On 03/26/2012 09:00 AM, C. Michael Pilato wrote: The on-disk cache will contain everything it does today where plaintext caching is enabled, save that the password won't be plaintext, and there will be a bit of known encrypted text (for passphrase validation). Is it important to be able to locally validate the passphrase? That property intrinsically enables offline dictionary attacks. We'd need to pull in additional dependencies that have freely available implementations on all our supported platforms. Blowfish, 3DES, or somesuch. Unfortunately, there's more complexity in an encrypted password store than you probably anticipated, and it's definitely possible to lose some or most of your intended security properties if you get it wrong. The choice of best cipher algorithm today is very simple (AES, although you'll have to pick the key size from 128/196/256 bits), but you do need to decide whether you want to be cipher-agile. Basically, if AES becomes a weak choice down the road (and it probably will, though it could be decades), is it better to be able to swap out the algorithm inside the password storage system, or better to just plan to swap out the system entirely for a redesigned one? Either position is defensible. You'll need to pick a function to map a passphrase to a crypto key. If you do a bad job, it will become easier to brute-force search for keys because your key distribution won't be even. To ensure even distribution, you typically need to use a hash function, which is an added dependency. String-to-key functions are also often deliberately slow, to make offline dictionary attacks harder. PBKDF2 (RFC 2898) is a reasonable choice here, and is implemented in some crypto libraries. You'll need to pick an appropriate cipher mode. If you simply use ECB (where you chunk the plaintext up into blocks and encrypt each block with the key), it will become easy to tell which passwords, or parts of passwords, are the same as which others. Maybe not a critical flaw, but certainly an avoidable one. If you use CTR (where you encrypt counter values with the key and XOR the result with the password), you'll need to make sure that counter values are never reused, or it will become easy to recover passwords with the key. CBC with a random initialization vector is also an option. If you don't use CTR mode, you'll need to pick a reversible padding function for the plaintext so that it matches a multiple of the cipher's block size. This is pretty simple. It's probably wise to look at what another implementation does. I'm not sure what password store implementations have made obvious mistakes and which ones haven't; I wish I had a better reference to give, but I don't know the state of the art for this particular application of crypto as well as others.
Re: Why do we check the base checksum so often?
On 02/04/2012 08:02 PM, Hyrum K Wright wrote: I don't know if apr has a sha256 implementation, but it wouldn't be hard to find one. I'll point out that we're nearing the end of a selection process for SHA-3, with a winner expected to be announced some time this year. The winner may wind up being faster than SHA-256 or even SHA-1. (For instance, one of the five finalists, Skein, is performance-competitive with SHA-1 according to numbers in a paper by its authors: http://www.skein-hash.info/sites/default/files/skein1.3.pdf) It sounds like wc-ng is somewhat hash-agile by virtue of the format number and upgrade process. It sounds like Ev2 may not be very hash-agile. If so, it's probably a bad idea to carve SHA-1 in stone, as it is already showing weaknesses. SHA-256 is likely to have a much longer useful lifetime, SHA-3 even more so. In a pinch, SHA-256 implementations can be pretty small; the one I have on hand is about 200 lines of code.
Re: [RFC] ra_svn::make_nonce: how to cope with entropy shortages?
On 11/03/2011 01:44 AM, Jonathan Nieder wrote: What do you think? Is forcing !APR_HAS_RANDOM and just using apr_time_now() as Debian currently does safe, or does it expose users to a security risk? I suspect it makes the server vulnerable to a replay attack. The right answer is to use /dev/urandom. Using /dev/random has highly questionable advantages over using /dev/urandom, and it's unfortunate that APR only provides an interface to one and not the other. A longer analysis: if a system has collected even a small amount of entropy (128 bits) relative to what an attacker can guess since boot, it can generate an infinite amount of pseudo-random data without risk of vulnerability, if it uses a suitable PRNG function. The actual dangers are that (1) the system has not accumulated enough entropy, and maybe we should wait until it has, or (2) the system has a bad PRNG function. Using /dev/random does not protect against either threat very effectively. As for the first threat, it's very difficult to mitigate because a system cannot generally estimate its entropy very well. It throws possible entropy events into a pool and mixes them together, but it doesn't have a very good measure of how guessable those events were. PRNG algorithms like Fortuna seek to guarantee that the PRNG will eventually reach an unguessable state (by ensuring that more and more possible entropy is used each time the internal state is updated, until eventually an update happens that the attacker can't brute-force), but they can't tell you when they've reached that point. As for the second threat, at least on Linux, /dev/random output still comes from the PRNG. It just keeps an internal counter and blocks when the PRNG has run out of its guess at estimated input entropy. (This is exceedingly silly, because a PRNG doesn't use up input entropy as it generates results; either it has unguessable internal state or it doesn't.) An application can only protect against a poor system PRNG by implementing its own generator, and it's far simpler to declare it the system's responsibility to fix its PRNG if there's a security issue associated with it.
Re: [RFC] ra_svn::make_nonce: how to cope with entropy shortages?
On 11/03/2011 05:10 PM, Jonathan Nieder wrote: Why would that be? When someone dumps in 20 bits of data from a strong, in-hardware, random number source, even if the PRNG is utterly stupid, it can have an unguessable 20 bits of internal state. After reading enough random numbers, I will have enough information to guess the seed and learn what comes next. If you want to attack a PRNG, you need very little of the output state--only enough to distinguish between the possible values of the generator seed. What you do need is for the generator seed to be partially guessable; otherwise, you will be trying to brute-force a 128-bit or 256-bit seed, which is impractical. If I somehow know the initial generator state, and you reseed your generator with only 20 unguessable bits, I will be able to determine those bits using 20 bits of output and 2^20 effort (which is easy), and then I will know all of the generator state again. However, if you reseed with enough unguessable bits that I can't brute-force them, it doesn't matter how much output I see; I will never again be able to determine the internal state. For the Fortuna generator, for instance, if I discover a way to determine the generator state solely by observing the output, then I will also have discovered a plaintext recovery attack against AES-256. For more, see chapter 9 of _Cryptography Engineering_. A good PRNG helps mitigate that somewhat More than somewhat. Any PRNG which doesn't have the above properties in its generator is insecure for any cryptographic purpose, and would be considered a security bug in the operating system. In another message, Peter Samuelson wrote: apr_time_now() has microsecond resolution. It has microsecond precision but not necessarily microsecond accuracy. For instance, http://fixunix.com/kernel/37-gettimeofday-resolution-linux.html suggests that two requests arriving within a 10ms window could get the same nonce.
Re: Thoughts about issue #3625 (shelving)
On Fri, 2011-09-09 at 08:09 -0400, Greg Stein wrote: Greg Hudson said this is more akin to git stash than branches. I haven't used git's stashes to see how it actually differs from branches. I guess it is simply that changing branches leaves local mods, rather than stashing pseudo-reverts the local mods. * Branches record tree states, while stashes record changesets as applied to the working copy and index. You can git stash, git checkout branchname, and git stash pop the changeset such that it is now applied against the new branch, if it applies cleanly. You can do similar things with branches using rebase, but the sequence of operations would be different and more complicated. * Stashes are a working copy feature, and aren't part of the history model. This isn't necessarily an interesting distinction for us, but it has some consequences within the universe of git--a subsidiary repository's git fetch won't retrieve stashes, they won't be in the target space of commit specifiers, you don't have to create commit messages for them, etc.. Stashes don't make git more expressive than local branches and rebase, but in some ways it's a useful UI concept to keep them separate. Mercurial calls it shelving. Aha. I'll note that shelving isn't a core feature of Mercurial but an extension. Even if there are aliases so the command is accessible via both names, the feature needs to have a primary name (which will be how it's documented in the book, etc.).
Re: Thoughts about issue #3625 (shelving)
On Thu, 2011-09-08 at 23:43 -0400, Greg Stein wrote: I've had to use git lately, and our shelves could almost look like git's branches. Swap around among them based on what you're doing at the time. I think this is closer to git's stash feature than git branches. In fact, I was thinking of jumping in and asking why this was being called something gratuitously different.
Re: RE: Proxy authentication with Negotiate uses wrong host
On Wed, 2011-08-24 at 07:42 -0400, 1983-01...@gmx.net wrote: Are you refering to sole Kerberos or are you just concerned about transport encryption? Your statement somewhat irritates me. Given that the HTTP traffic cannot be securely wrapped into the GSS content and nor the SASL QOP can be set (like for LDAP), I would neglect that and still say TLS is not of your concern but of mine or the users in general. Any authentication-only mechanism used over an insecure channel is vulnerable to MITM attacks which preserve the authentication and change the data. Of course, this applies to HTTP basic and digest over raw HTTP just as much as it does to negotiate, so perhaps it doesn't make sense to restrict negotiate auth to HTTPS only on this basis alone. A further concern with HTTP negotiate is that it is scoped to the TCP connection and not to a single HTTP request. Ignorant proxies may combine TCP connections for multiple users' requests and inadvertently authenticate one users' requests with anothers' credentials. I may be wrong, but I believe this is the concern which leads implementations to restrict NTLM to HTTPS. Switching from NTLM to Kerberos does not mitigate this concern at all. If there are other vulnerabilities in NTLM which don't presuppose an MITM attack, perhaps I'm wrong.
RE: Proxy authentication with Negotiate uses wrong host
On Wed, 2011-08-24 at 05:52 -0400, Bert Huijben wrote: Then somebody added Kerberos support to neon, but the api wasn't updated to allow different behavior for the specific implementations. Kerberos via HTTP negotiate is also insecure when not used over HTTPS. In HTTP negotiate, the GSSAPI mechanism (Kerberos) isn't used to protect the data stream, only to authenticate. So you still need a secure channel. (Also, negotiate auth does no channel binding, which means Kerberos provides no additional protection against MITM attacks on the TLS channel. That just means it's still important for the client to verify the server cert. I've heard that Microsoft has some extensions to RFC 4559 to do channel binding, but I don't know any details and Neon almost certainly doesn't have any support for it.)
Re: Did we have ^/clients?
On Tue, 2011-08-16 at 14:14 -0400, Daniel Shahaf wrote: r6881 implies that a ^/clients directory existed until r6880: https://svn.apache.org/viewvc/subversion/README?r1=846955r2=846954pathrev=846955diff_format=f kfogel on IRC recalls it having existed. I remember svn (the command) living under subversion/clients/svn and being moved to subversion/svn. If there's no evidence of this in our Subversion history, maybe the move happened back when we were still using CVS. (I don't believe we preserved our CVS history when we started self-hosting, because cvs2svn was a difficult and not-yet-solved problem.)
Re: It's time to fix Subversion Merge
On Mon, 2011-07-11 at 12:48 -0400, Mark Phippard wrote: 2. Subversion does not handle move/rename well (tree conflicts) [...] When this problem was first approached (before we came up with tree conflicts) it hit a brick wall where it seemed a new repository design was needed: It's worth considering that git has a reputation for good merge support even though it has no commit-time copy/rename history whatsoever in its history model. By contrast, bzr paid a lot of attention to container history and merge support in the face of tree reorgs, and it clearly isn't as much of a killer feature as its designers had expected (http://www.markshuttleworth.com/archives/123). So, one possible way forward is to decide that copy history is just a hint for svn log and that merging should ignore it.
Re: [PATCH] Fix for issue 3813
On Wed, 2011-06-22 at 02:29 -0400, Daniel Shahaf wrote: From looking at the code, svn_io_open_unique_file3() would force the file to have a mode of 0600|umask() instead of just 0600 The umask removes file permissions from the mode argument to open(); it doesn't add permissions. (Unless there's something unusual about this code.)
Re: diff wish
On Wed, 2011-06-15 at 09:38 -0400, Johan Corveleyn wrote: But I don't like the hand-waving discussion that it will always be superior, period. That's just not true. And it would be a big mistake, IMHO, to only support a heuristic diff. If it's a big mistake to use a heuristic diff by default, then adding options to change the diff algorithm will not mitigate this mistake. Similarly, adding options to support a heuristic diff as not-the-default is almost completely useless. I know from experience that it's very easy to stare at a problem for long enough to convince yourself that other people care about it as much as you do, but in reality, to a very good approximation, nobody wants to play around with diff algorithm options. There are probably a few dozen people out there who have configured git diff to use --patience by default and like it, but in the scheme of things, it's dead code. Options come at a cost in code complexity and documentation bulk. Supporting options for the sake of a very small fraction of users, without strong evidence of a compelling need for those users, is not the right tradeoff for a code base.
Re: diff wish
On Wed, 2011-06-15 at 11:30 -0400, Johan Corveleyn wrote: Okay, I guess we should then also rip out --ignore-space-change and --ignore-eol-style, and perhaps --show-c-function. Or, if it's preferred that ignore-space-change and ignore-eol-style be used by default (because humans are normally not interested in changes in amount of whitespace), we should use those options by default, and not provide an option to disable them. Fine by me. Those are not options for determining the diff algorithm. They are options for preprocessing the diff inputs or postprocessing the output. Although they're probably only used by a small minority of users, there's pretty strong evidence of a compelling need for them.
Re: svn commit: r1136114 - /subversion/trunk/configure.ac
On Wed, 2011-06-15 at 13:28 -0400, Philip Martin wrote: Do they all support -s? cmp -s is one of the most portable Unix command invocations in existence (from general knowledge; I can't give a reference).
Re: Improvements to diff3 (merge) performance
On Mon, 2011-06-13 at 07:00 -0400, Morten Kloster wrote: I assume he has discussed this elsewhere in more detail? The link you provided says very little about it (and the ONLY hit for implicit cherrypicking on Google was your post :-). Yes, but I'm not sure where any more, unfortunately. Possibly here: http://lists.zooko.com/pipermail/revctrl/ but that's a big archive to look through. Complicating matters, Codeville merge operates on the entire history of the two nodes, rather than just a common base. As mentioned above, my original proposal was somewhat more aggressive than strictly necessary for my purposes. I think if you limit the merging to strictly larger changes between sync points, the false negative rate shouldn't go up too much. Also, I think the user should be allowed to specify how aggressive the merge algorithm should be as an option. Perhaps, but most users aren't going to want to fiddle with merge options, so the onus is still on the system to pick a good default. (It does help if the options immediately make sense, which they do in this proposal. Options like git's --patience and --strategy octopus are especially unlikely to be used productively, I would think.)
Re: Improvements to diff3 (merge) performance
My executive summary of your post is that you want diff3 to try to merge related, but not identical, changes occuring between a pair of sync points. I'm wary about this for two reasons. First, the benefit appears to arise chiefly for what Bram Cohen calls implicit cherrypicking use cases--that is, cases where a change is merged and then merged again together with other changes. After extensive research, Bram eventually concluded that trying to support this is a bad idea (http://bramcohen.livejournal.com/52148.html). I tend to think that a merge algorithm should not increase its false negative rate for the benefit of implicit cherrypicking. Second, I can see a lot of potential for incorrect merges based on superficial similarities. For instance, if you and I both add code between the same two sync points, and the last line of my change is the same as the first line of yours (perhaps a blank line), that could be enough to unambiguously determine an ordering. Perhaps both of our code additions do the same thing in different ways, and now it gets done twice (which is almost never desirable). Certainly, the existing algorithm can produce incorrect merges too, but my intuition says that the practical likelihood would become much higher. Of course, I could be off base. All merge algorithms are heuristic, and it would take a lot of research to really compare the efficacy of one against another. You need a cost function to determine the tradeoff between the false negative rate and the false positive rate, and you also need to measure how any given algorithmic change affects the false negative and false positive rates in practice. Both of these seem really hard. It would definitely affect my opinion if I learned that the three-way merge algorithms in other popular version control systems worked this way, or if I learned that Subversion was more restrictive than, say, gnu diff3.
Re: strange error message
On Wed, 2011-05-18 at 14:24 -0400, Stefan Küng wrote: the not to all point... just doesn't sound right. It's a split infinitive, which doesn't make it necessarily bad English but can make it sound wrong. Not to point to the same repository would be more concise and just as precise, in my opinion.
Re: Why do we include debug symbols in !maintainer_mode?
On Thu, 2011-04-14 at 14:25 -0400, Philip Martin wrote: I believe it is a GNU standard. Debug symbols can be used with an optimised build although it is obviously easier to debug without optimisation More specifically: stepping through a -g -O2 executable is pretty painful, but you can still usually get a decent stack trace from one.
Re: Is the svn:// protocol secure when encrypted via SASL?
On Mon, 2011-02-21 at 14:48 -0500, Keith Palmer Jr. wrote: Nothing in what you just copy-pasted indicates whether it's *the actual data stream* that's being encrypted, or just the *authentication*. I need to know if the checked-out files that are being transferred are encrypted or not. The SASL security layer refers to protection of the actual data stream. Encryption of the authentication isn't really a meaningful concept in SASL parlance; mechanisms always perform authentication steps as securely as they are able.
Re: Deltifying directories on the server
On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote: I can only really speak for the BDB side of things, but... what he said. I'll elaborate a little bit. API issues aside, we're used to putting artifacts from different versions in different places. More so in FSFS, where it was baked into the initial architecture, but also in BDB for the most part. The most efficient storage for large directories which frequently change by small deltas would be some kind of multi-rooted B-tree. To do that efficiently (that is, without scattering each tiny change into a separate disk block, requiring lots and lots of opens/seeks/reads), you'd want to put artifacts from different versions of a directory all in the same place. You might be able to arrange it so that modifying a directory is an append-only operation, avoiding the need for a lot of copying, but you'd still want a place to append to for each directory, which isn't how FSFS or BDB works. So, I'm not sure we can ever have efficient handling of this use case without designing a completely new back end--which wouldn't be a terrible idea if someone's up to it.
Re: svn commit: r1064168 - in /subversion/trunk/subversion/include/private: svn_eol_private.h svn_fs_util.h svn_mergeinfo_private.h svn_opt_private.h svn_sqlite.h svn_wc_private.h
On Thu, 2011-01-27 at 21:46 -0500, Senthil Kumaran S wrote: A NULL does mean '\0' or (void *) 0x. I also referred this - http://en.wikipedia.org/wiki/Null_character which says the same when referring to NULL termination of a string, except for one place where it says 'NUL' is an abbreviation for NULL character - http://en.wikipedia.org/wiki/NUL NULL (all caps) is a C preprocessor constant used to denote a null pointer. NUL (all caps) is sometimes used as an abbreviation for the null character. Null (uncapitalized except as appropriate for beginning a sentence) is an noun or adjective which can be used in a variety of contexts. You will note that null is never written in all caps in the Wikipedia articles you referenced. So a NULL-terminated string is a meaningless concept; you can't use a null pointer to terminate a character string. A NUL-terminated string is meaningful, as is a null-terminated string or null-character-terminated string.
Re: gpg-agent branch treats PGP passphrase as repository password?
On Mon, 2010-12-06 at 07:30 -0500, Daniel Shahaf wrote: Ideally, Subversion won't know the PGP passphrase. (If it does, then a malicious libsvn_subr can compromise a private key.) I think you're trying to solve a different problem here. The goal is to minimize typing of passwords without storing passwords in a fixed medium, not to protect keys against malicious or broken Subversion code. For comparison, the ssh-agent protocol[1] only allows a client of the agent to authenticate himself (using the agent) to a third party, but does not have a Retrieve secret key option [2]. If we are to use PGP, could we find a solution with similar properties? ssh-agent has special knowledge of the operations which will be performed using the keying material. PGP probably doesn't have any interest in the operations Subversion needs to do with passwords. PKCS#11 is the most commonly used general API for operations where an application can use a key but isn't allowed to know what it is. The most useful free software implementation of PKCS#11 is probably NSS. I don't think we want to go there, though.
Re: [Proposed] Split very long messages by paragraph for easy translate
On Sat, 2010-11-13 at 10:31 -0500, Daniel Shahaf wrote: Sounds reasonable. What changes to the source code would be required? Do we just change N_(three\n\nparagraphs\n\nhere\n) to N_(three\n) N_(paragraphs\n) N_(here\n) No, that would just result in evaluating gettext on the combined string, same as before. I can see two options for help strings in particular: 1. Rev svn_opt_subcommand_desc2_t to include an array of help strings which are translated and displayed in sequence. 2. Change print_command_info2 to look at the help string and break it up at certain boundaries (such as blank lines or numbered list entries) before translating it. (Mercurial is written in Python, so it has different constraints.)
Re: FSv2 (was: FREE Apache Subversion Meetup...)
On Tue, 2010-10-19 at 04:31 -0400, Greg Stein wrote: The FSFS backend was dropped in as a fait d'accompli. A minor correction: ra_svn was dropped in as a fait d'accompli. FSFS was, as far as I remember, a pretty open process where I created a design and Josh Pieper implemented it. You can look at the commit history of libsvn_fs_fs to see that, and I'm pretty sure that Josh and I were working over the expected open channels (dev list and IRC) at the time.
Re: svn commit: r1003986 [1/2] - in /subversion/trunk/subversion: libsvn_client/ libsvn_fs_base/ libsvn_fs_base/bdb/ libsvn_fs_fs/ libsvn_ra_local/ libsvn_ra_neon/ libsvn_ra_serf/ libsvn_ra_svn/ libsv
On Mon, 2010-10-04 at 06:14 -0400, Julian Foad wrote: The NULL macro is intended for use as a pointer. Only when statically cast to the appropriate pointer type. This happens automatically in many contexts, such as assignments or prototyped function parameters. But it does not happen automatically for variable parameters of a stdarg function. So apr_pstrcat(foo, bar, NULL) really is invalid C code. It's not a practical concern because common platforms use a single pointer representation, but it's a fair warning for a compiler to give. This message brought to you by Language Lawyers Inc.
Re: svn commit: r1003986 [1/2] - in /subversion/trunk/subversion: libsvn_client/ libsvn_fs_base/ libsvn_fs_base/bdb/ libsvn_fs_fs/ libsvn_ra_local/ libsvn_ra_neon/ libsvn_ra_serf/ libsvn_ra_svn/ libsv
On Mon, 2010-10-04 at 12:06 -0400, Julian Foad wrote: The issue at hand is when NULL is defined as an unadorned '0' *and* is passed to a variadic function such as apr_pstrcat. If that's not a practical concern, that must be because the size and representation of (int)0 is the same as (char *)0. If that is true on all supported platforms, then omitting the casts is a valid option; otherwise, we need them. I don't think you have it quite right. If NULL is defined as, say, (void *)0, then the compiler might not warn, but it's still not correct C to pass a void * to a variadic function and then read it out as some other pointer type. void * is only guaranteed to work if it's cast (implicitly or implicitly) to the correct pointer type. We don't see this problem as run-time errors because there is typically no runtime transformation between void * and other pointer types. If NULL is defined as 0, it would be easy to imagine a practical problem where you pass a 32-bit 0 value to a variadic function and then read it back out as a 64-bit pointer type. We don't see this problem in practice because NULL is typically defined as 0L rather than just 0, and sizeof(long) typically matches the size of pointers. In terms of code hygiene, passing an uncasted NULL to a variadic function isn't any worse than using calloc() to initialize pointer fields in structures. But if your compiler is giving you grief about it, the best thing to do is probably to get used to casting in that situation. Phillip wrote: In C++ the cast should be more common since a conforming NULL cannot have a cast but, in the free software world at least, GCC uses compiler magic to make plain NULL work as a pointer without a cast. Unlike C, C++ doesn't allow implicit casts from void * to other pointer types, so defining NULL as (void *)0 would be pretty inconvenient there.
Re: Extensible changeset format proposal
On Thu, 2010-08-26 at 05:57 -0400, anatoly techtonik wrote: Don't you think it is time to design an extensible changeset format for exchanging information about changesets between systems? Mostly for your entertainment, see: http://www.red-bean.com/pipermail/changesets/2003-April/thread.html There was an attempt to create a unified cross-system changeset format seven years ago, but it didn't get very far. However, the principals are different today and more is known about the space of successful DVCS tools.
Re: Looking to improve performance of svn annotate
On Tue, 2010-08-17 at 09:26 -0400, Johan Corveleyn wrote: Greg, could you explain a bit more what you mean with edit-stream-style binary diffs, vs. the binary deltas we have now? Could you perhaps give an example similar to Julian's? Wouldn't you have the same problem with pieces of the source text being copied out-of-order (100 bytes from the end/middle of the source being copied to the beginning of the target, followed by the rest of the source)? Let's take a look at the differences between a line-based edit stream diff (such as what you'd see in the output of diff -e) and a binary delta as we have in Subversion. The most obvious difference is that the former traffics in lines, rather than arbitrary byte ranges, but the actual differences are much deeper. A line-based diff can be modeled with the following instructions: * Copy the next N lines of source to target. * Skip the next N lines of source. * Copy the next N lines of new data to target. After applying a diff like this, you can easily divide the target lines into two categories: those which originated from the source, and those which originated from the diff. The division may not accurately represent the intent of the change (there's the classic problem of the mis-attributed close brace, for instance; see http://bramcohen.livejournal.com/73318.html), but it's typically pretty close. Subversion binary deltas have a more flexible instruction set, more akin to what you'd find in a compression algorithm. The source and target are chopped up into windows, and for each window you have: * Copy N bytes from offset O in the source window to target. * Copy N bytes from offset O in the target window to target. * Copy the next N bytes of new data to target. There is no easy way to divide the target into source bytes and diff bytes. Certainly, you can tag which bytes were copied from the source window, but that's meaningless. Bytes which came from the source window may have been rearranged by the diff; bytes which came from new data may only have done so because of windowing. The optimization idea is to create a new kind of diff (or more likely, research an existing algorithm) which obeys the rules of the line-based edit stream--no windowing, sequential access only into the source stream--but traffics in bytes instead of lines. With such a diff in hand, we can divide the target bytes into source-origin and diff-origin, and then, after splitting the target into lines, determine which lines are tainted by diff-origin bytes and therefore should be viewed as originating in the diff.
Re: Looking to improve performance of svn annotate
On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote: I'm wary of embedding any client functionality in the server, but I guess it's worth considering if it would be that useful. If so, let's take great care to ensure it's only lightly coupled to the core server logic. Again, it's possible that binary diffs between sequential revisions could be used for blame purposes (not the binary deltas we have now, but edit-stream-style binary diffs), which would decouple the line-processing logic from the server. (But again, I haven't thought through the problem in enough detail to be certain.)
Re: Looking to improve performance of svn annotate
On Wed, 2010-08-11 at 19:14 -0400, Johan Corveleyn wrote: I naively thought that the server, upon being called get_file_revs2, would just supply the deltas which it has already stored in the repository. I.e. that the deltas are just the native format in which the stuff is kept in the back-end FS, and the server wasn't doing much else but iterate through the relevant files, and extract the relevant bits. The server doesn't have deltas between each revision and the next (or previous). Instead, it has skip-deltas which may point backward a large number of revisions. This allows any revision of a file to be reconstructed in O(log(n)) delta applications, where n is the number of file revisions, but it makes what the server has lying around even less useful for blame output. It's probably best to think of the FS as a black box which can produce any version of a file in reasonable time. If you look at svn_repos_get_file_revs2 in libsvn_repos/rev_hunt.c, you'll see the code which produces deltas to send to the client, using svn_fs_get_file_delta_stream. The required code changes for this kind of optimization would be fairly deep, I think. You'd have to invent a new type of diffy delta algorithm (either line-based or binary, but either way producing an edit stream rather than acting like a compression algorithm), and then parameterize a bunch of functions which produce deltas, and then have the server-side code produce diffy deltas, and then have the client code recognize when it's getting diffy deltas and behave more efficiently. If the new diffy-delta algorithm isn't format-compatible with the current encoding, you'd also need some protocol negotiation.
Re: Bikeshed: configuration override order
On Tue, 2010-08-10 at 14:24 -0400, C. Michael Pilato wrote: The foremost bit of client configuration that CollabNet's Subversion customers are demanding (besides auto-props, which I think we all agree on) is a way for the server to set a policy which dictates that clients may not use plaintext or other insecure password storage mechanisms. I don't expect anyone to consider my opinion blocking, but I think this is a questionable area for any kind of software to delve into. I've only seen this kind of client control in one other context (a branded Jabber client), and never in an open source project. (*) Lots and lots of clients are able to remember passwords: web browsers, email clients, IM clients. Lots of central IT organizations (MIT's included) don't like this feature and recommend that users not use it. Lots of users do it anyway. I don't know of a single piece of widely-used client software which allows the server to turn off password memory. (*) Actually, on consideration, there was some flap about the okay to print flag in PDF documents, or something related to that. I can't remember how it turned out.
Re: Bikeshed: configuration override order
On Sat, 2010-08-07 at 07:58 -0400, Daniel Shahaf wrote: Stefan Küng wrote on Sat, Aug 07, 2010 at 12:59:26 +0200: On 07.08.2010 12:44, Daniel Shahaf wrote: If corporations want to have configuration override, fine. But I want a way to disable that completely. I don't necessarily want to allow a random sourceforge repository to control my auto-props settings for a wc of that repository. Maybe a stupid question: why not? Why don't I let ezmlm configure my mailer's use html? setting? I think he was asking for an answer specifically relating to auto-props, not an answer about configuration in general. There's not generally a lot of room for individual disagreement about what auto-props should be for a given project. My thinking about repository configuration is that the uses cases should be divided into two categories: 1. Stuff that isn't really client configuration at all, like auto-props, should come from the repository instead, and should only come from client configuration for compatibility. 2. Stuff that is client configuration should only come from client configuration. Client control is not legitimate business for an open source product, though it could be the business of a proprietary value-added fork. Note that there's no general extension of the config framework here, no whitelisting or blacklisting, no override settings. Invent a mechanism for getting repository configuration from the server and apply it to the specific use cases in (1), falling back to client configuration as a legacy mechanism.
Re: Proposal: Change repository's UUID over RA layer
When I've mirrored repositories with the intent of keeping them in sync, I've typically given them the same UUID. I don't know if that has much impact in practice, since I think working copies tend to stick to one of the mirrors (either the RW master or the RO slave). The philosophical question here isn't whether the ID is universally unique but what it's identifying. Is it identifying the repository content or the the container in which the content is held?
Re: Bug: svnserve fail to detect it is already running
On Fri, 2010-07-09 at 11:44 -0400, Stefan Sperling wrote: As far as I can tell there is little we can do to secure svnserve against this attack on Windows systems other than Server 2003, because APR won't let us set the SO_EXCLUSIVEADDR option. That's okay, we don't want the SO_EXCLUSIVEADDR behavior. We want the default behavior under Windows, which corresponds to the SO_REUSEADDR behavior under Unix.
Re: Suggestion: Transparent Branching
On Wed, 2010-07-07 at 11:44 -0400, Marco Jansen wrote: So therefor, what we would like to see is to be able to have a transparent branch: One which fetches updates from both branch and trunk, without having them listed as changes or triggering commits. In essence it's reading from two branches, where a last known revision of a file could be from either branch, and committing to one only when it has changes from this 'either' latest revision. I'm not sure if this is a feature of any popular version control system. What would happen if trunk changes didn't merge easily with the changes on one or more transparent branches?
Re: Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs
On Thu, 2010-07-01 at 08:56 -0400, michael.fe...@evonik.com wrote: I better already start to run for it, when I ever approve the use of the current implementation of the representation cache. Here's what this says to me: it doesn't matter what the real risks are; it only matters that the quantifiable mathematical risks I know about be reduced to 0, regardless of the cost. That's sometimes a rational attitude to take in the world of legalities and politics. In the world of engineering, it's not popular, which is why hash-indexed storage and cryptography are used in a wide variety of applications. It's pointless to reduce the quantifiable risks from 2^-(many) to 0 when we know that the human factors and mechanical risks are much larger.
Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs
On Fri, 2010-06-25 at 08:45 -0400, michael.fe...@evonik.com wrote: I am actually more interested in finding reliable solution instead of discussing mathematics and probabilities. The discussion of probabilities affects whether it would be justifiable to change Subversion to address hash collisions. 1. You are comparing apples and oranges. 2. you can't balance the possibility of one error with the that of an other. All systems have a probability of failure, resulting from both human and mechanical elements within the system. It may be difficult to estimate precisely, but one can often establish a lower bound. The question is whether hash-indexed storage increases the probability of failure by a non-negligible amount. It often results in something like: square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...) We're discussing failure rates, not margin of error propagation. Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...) if the failure probabilities are independent. If your system has a probability of failure of one in a million from other factors, and we add in an independent failure probability of one in 2^32 from hash-indexed storage, then the overall system failure probability is one in 999767--that is, it doesn't change much. 3. you over estimate the risk of undetected hardware faulty. I think you over-estimate the risk of hash-index storage collisions. There is no evidence that the hash vales are equally distributed on the data sets, which is import for the us of hashing method in data fetching. A hash which had a substantially unequal distribution of hash values among inputs would not be cryptographically sound. = 3,21*10^2427 sequences of Data of 1K size represented by the same hash value. First, SHA-1 is a 160-bit hash, not a 128-bit hash. Second, the number you calculated does not inform the probability of a collision. If you have N samples, which are not specifically constructed as to break SHA-1, then probability of a SHA-1 collision is roughly N^2 / 2^160 (see birthday paradox for more precision). So, for example, with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32 probability of a SHA-1 collision in the rep cache. If you can construct a system with close to a one in four billion probability of error from other sources, kudos to you; if not, hash-indexed storage is not perceptibly increasing your error rate.
Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs
On Thu, 2010-06-24 at 11:29 -0400, michael.fe...@evonik.com wrote: We must ensure that the data in the repository is, without any concerns, the data we have once measured or written. You do realize that the probability of data corruption due to faulty hardware is much, much more likely than the probability of corruption due to a rep-sharing SHA-1 collision, right?
Re: Optimizing back-end: saving cycles or saving I/O (was: Re: [PATCH v2] Saving a few cycles, part 3/3)
On Wed, 2010-05-12 at 13:44 -0400, Hyrum K. Wright wrote: There may be other ways of caching this information, which would be great. Maybe. Caches add complexity, and too many layers of caching (disk, OS, application) can actually reduce performance in some scenarios. I would suggest getting a better understanding of why this operation is slow. Why is svn log opening each rev file ten times? Is this intrinsically necessary? Going straight to optimizing the overused low-level operations can provide a noticeable performance benefit, but fixing the inefficient high-level algorithms is how you can turn minutes into microseconds.
Re: description of Peg Revision Algorithm is incomplete
On Mon, 2010-03-29 at 12:07 -0400, Julian Foad wrote: Some possible interpretations are * Find the repository URL of './some/deep/file.c', and [...] I'll mention a related interpretation, which is to use the repository URL of the parent directory and append file.c to it. This is a little weird, and probably only makes sense as a fallback if the file doesn't have a URL (e.g. because it doesn't exist in the working copy), but it would let you do things like svn cp deleted-f...@1000 . I may have filed an issue about this somewhere, possibly in the days before peg-revs.
Re: Hook scripts start with an empty environment
Although I've always been aware of the design intent behind empty hook script environments, I'll echo Tim's complaint that it's sometimes inconvenient. The problem most commonly crops up in svn+ssh:// or file:// deployments where you want to run some action with user credentials: updating a bug database, updating a shared read-only working copy, updating a snapshot of the repository, etc.. If the credentials are pointed to by environment variables (e.g. Kerberos tickets), then the operation fails. If they are pointed to by other means (e.g. AFS tokens), then the operation may succeed anyway. In some cases it may be possible to perform the action with server host credentials instead, but that is not always a good option. The design intent behind clearing the environment has two bases: security and consistency. In evaluating the security basis, you have to consider the type of deployment: * http:// and svn:// access, where the client has little or no control over the environment. In this case propagating the environment carries no risk because the environment is controlled by the admin. * file:// and svn+ssh:// access where the calling user has unrestricted shell priveleges and there is no setuid or setgid bit on the svn/svnserve binary. In this case propagating the environment carries no risk because the svn code is not executing with elevated privilege. * file:// and svn+ssh:// access where the calling user has restricted shell access (can only run svn/svnserve for some reason) or there is a setuid or setgid bit on the svn or svnserve binary. In this case propagating the environment could open the door to unintended access. It might be reasonable to have said from the start, if you're in the third situation, then your hook scripts should clear their own environments, but we can't start saying that in release 1.7. We can detect a setuid or setgid bit, but we cannot detect a restricted shell situation (such as when .ssh/authorized-keys contains a command directive), so we can't really intuit when it's safe to propagate the environment. It would be reasonable to make this controlled by repository configuration, if there's a convenient place to put that bit.
Re: Subversion in 2010
On Fri, 2010-01-08 at 15:31 -0500, Paul Querna wrote: Profile everything, be faster at everything There are smart people who will disagree with me on this, but I'm not sure the best tool for improving Subversion performance is a profiler. Historically a lot of our performance issues have come from algorithmic inefficiencies, particularly from code in libsvn_repos. You can get maybe a 10-30% reduction in time on these operations by trying to optimize the innermost bit of code which gets repeated over and over again, but in some cases you can get an order of magnitude reduction in time by fixing the algorithms. It's been too long for me to remember any specific examples, unfortunately.
Re: Subversion in 2010
On Wed, 2010-01-06 at 21:26 -0500, Mark Mielke wrote: There is a race between the pull and push whereby somebody who pushes before I pull will cause my push to fail, but we generally consider this a good thing as it allows us to analyze the change and determine whether additional testing is required before we try to submit pull and push again. I've certainly heard this argument before, and I think it makes sense for some projects. However, I've heard from some people that they work on projects where they would never be able to get any work done (that is, they would never be able to commit anything in a reasonable amount of time) if they always had to pull before pushing and hope that no one else pushes in the mean time. Those projects are simply too active to support a pull, test/analyze, and push development model. In some cases this is because the project has been defined to be larger than it really needs to be. For instance, the ASF repository would be pretty frustrating to use if always you had to be up to date before committing, but it's easy to see how the ASF might have created separate repositories for each project instead of what it did. In other cases the project is not so easily subdivided.
Re: Subversion in 2010
On Mon, 2010-01-04 at 11:31 -0500, C. Michael Pilato wrote: To be a compelling replacement for git/Mercurial, perhaps? That seems tough. The major architectural differences between git/Mercurial/Bazaar and Subversion are: * No commitment to mixed-revision working copies. * Full history of at least one branch is generally stored on clients. * DVCS workflow support. For small projects and a certain class of developers, these can be huge advantages. For huge projects and a different class of developers, these can be hindrances. (See also http://svn.haxx.se/dev/archive-2008-04/1020.shtml)