Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...

2012-04-06 Thread Greg Hudson
On 04/06/2012 10:47 AM, Greg Stein wrote:
 Correct. Still useful, but even if memory is compromised, the SHA1 is
 not reversible. The original MP cannot be recovered for other uses.

Just as a reminder, SHA-1 is not recommended for use in new applications
at this time (http://csrc.nist.gov/groups/ST/hash/policy.html).


Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...

2012-04-06 Thread Greg Hudson
On 04/06/2012 10:55 AM, Greg Stein wrote:
 In other words, changing the master passphrase only requires decrypting
 and re-encrypting one 256-bit encryption key, not the whole credentials
 store.

 PKBDF2 is in the current design to make dict attacks computationally
 impossible. Assuming we keep that, then the above value would be fed
 in as the secret to PKBDF2, rather than MP or sha1(MP) ?

If I understand you correctly, that wouldn't make sense.  PBKDF2 is
designed to provide some resistance against offline dictionary attacks
against a weak secret, at the cost of computational power for legitimate
users.  If you have a strong secret, there's no point in running it
through PBKDF2.

Under the suggested architecture, you'd use PBKDF2(MP) to decrypt the
master key, and then use the master key to decrypt the individual passwords.

I also want to caution that PBKDF2 does not provide strong protection
against offline dictionary attacks.  Most cryptographic methods provide
exponential protection--I do a little bit more work to make you do twice
as much work.  PBKDF2 provides only linear protection--I do twice as
much work to make you do twice as much work.  It does not make
dictionary attacks impossible in the same sense that AES-128 makes
decryption without knowing the key impossible.

If a system can be designed to prevent offline dictionary attacks
entirely, that's much better.  But for this application, that's probably
impossible, since it's easy to distinguish a valid result (a password,
which will be printable ASCII) from garbage.


Re: Master passphrase approach, authn storage, cobwebs in C-Mike's head, ...

2012-04-06 Thread Greg Hudson
On 04/06/2012 01:44 PM, Justin Erenkrantz wrote:
 On Fri, Apr 6, 2012 at 8:09 AM, Greg Hudson ghud...@mit.edu wrote:
 I also want to caution that PBKDF2 does not provide strong protection
 against offline dictionary attacks.  Most cryptographic methods provide
 exponential protection--I do a little bit more work to make you do twice
 as much work.  PBKDF2 provides only linear protection--I do twice as
 much work to make you do twice as much work.  It does not make
 dictionary attacks impossible in the same sense that AES-128 makes
 decryption without knowing the key impossible.
 
 Is it worth looking at scrypt[1] instead of PBKDF2?  -- justin

Possibly.  It depends on whether you care about things like NIST review
(PBKDF2 is recommended in NIST SP 800-132) versus the theoretical
advantages of a less heavily scrutinized algorithm.  That's always a
tough choice.

The fundamental nature of scrypt isn't different from the fundamental
nature of PBKDF2; both seek to add a fixed multiplier to the cost of
both the legitimate user and the attacker.  scrypt is designed to make
it more difficult to use massively parallel hardware to mount the
attack, by requiring more memory (if I skimmed the paper correctly).


Re: [Issue 4145] Master passphrase and encrypted credentials cache

2012-03-26 Thread Greg Hudson
On 03/26/2012 09:00 AM, C. Michael Pilato wrote:
 The on-disk cache will contain everything it does today where
 plaintext caching is enabled, save that the password won't be
 plaintext, and there will be a bit of known encrypted text (for
 passphrase validation).

Is it important to be able to locally validate the passphrase?  That
property intrinsically enables offline dictionary attacks.

 We'd need to pull in additional dependencies that have freely 
 available implementations on all our supported platforms.
 Blowfish, 3DES, or somesuch.

Unfortunately, there's more complexity in an encrypted password store
than you probably anticipated, and it's definitely possible to lose
some or most of your intended security properties if you get it wrong.

The choice of best cipher algorithm today is very simple (AES,
although you'll have to pick the key size from 128/196/256 bits), but
you do need to decide whether you want to be cipher-agile.  Basically,
if AES becomes a weak choice down the road (and it probably will,
though it could be decades), is it better to be able to swap out the
algorithm inside the password storage system, or better to just plan
to swap out the system entirely for a redesigned one?  Either position
is defensible.

You'll need to pick a function to map a passphrase to a crypto key.
If you do a bad job, it will become easier to brute-force search for
keys because your key distribution won't be even.  To ensure even
distribution, you typically need to use a hash function, which is an
added dependency.  String-to-key functions are also often deliberately
slow, to make offline dictionary attacks harder.  PBKDF2 (RFC 2898) is
a reasonable choice here, and is implemented in some crypto libraries.

You'll need to pick an appropriate cipher mode.  If you simply use ECB
(where you chunk the plaintext up into blocks and encrypt each block
with the key), it will become easy to tell which passwords, or parts
of passwords, are the same as which others.  Maybe not a critical
flaw, but certainly an avoidable one.  If you use CTR (where you
encrypt counter values with the key and XOR the result with the
password), you'll need to make sure that counter values are never
reused, or it will become easy to recover passwords with the key.  CBC
with a random initialization vector is also an option.

If you don't use CTR mode, you'll need to pick a reversible padding
function for the plaintext so that it matches a multiple of the
cipher's block size.  This is pretty simple.

It's probably wise to look at what another implementation does.  I'm
not sure what password store implementations have made obvious
mistakes and which ones haven't; I wish I had a better reference to
give, but I don't know the state of the art for this particular
application of crypto as well as others.


Re: Why do we check the base checksum so often?

2012-02-04 Thread Greg Hudson
On 02/04/2012 08:02 PM, Hyrum K Wright wrote:
 I don't know if apr has a sha256 implementation, but it wouldn't be hard to 
 find one.

I'll point out that we're nearing the end of a selection process for
SHA-3, with a winner expected to be announced some time this year.  The
winner may wind up being faster than SHA-256 or even SHA-1.  (For
instance, one of the five finalists, Skein, is performance-competitive
with SHA-1 according to numbers in a paper by its authors:
http://www.skein-hash.info/sites/default/files/skein1.3.pdf)

It sounds like wc-ng is somewhat hash-agile by virtue of the format
number and upgrade process.  It sounds like Ev2 may not be very
hash-agile.  If so, it's probably a bad idea to carve SHA-1 in stone, as
it is already showing weaknesses.  SHA-256 is likely to have a much
longer useful lifetime, SHA-3 even more so.

In a pinch, SHA-256 implementations can be pretty small; the one I have
on hand is about 200 lines of code.


Re: [RFC] ra_svn::make_nonce: how to cope with entropy shortages?

2011-11-03 Thread Greg Hudson
On 11/03/2011 01:44 AM, Jonathan Nieder wrote:
 What do you think?  Is forcing !APR_HAS_RANDOM and just using
 apr_time_now() as Debian currently does safe, or does it expose users
 to a security risk?

I suspect it makes the server vulnerable to a replay attack.

The right answer is to use /dev/urandom.  Using /dev/random has highly
questionable advantages over using /dev/urandom, and it's unfortunate
that APR only provides an interface to one and not the other.

A longer analysis: if a system has collected even a small amount of
entropy (128 bits) relative to what an attacker can guess since boot, it
can generate an infinite amount of pseudo-random data without risk of
vulnerability, if it uses a suitable PRNG function.  The actual dangers
are that (1) the system has not accumulated enough entropy, and maybe we
should wait until it has, or (2) the system has a bad PRNG function.
Using /dev/random does not protect against either threat very effectively.

As for the first threat, it's very difficult to mitigate because a
system cannot generally estimate its entropy very well.  It throws
possible entropy events into a pool and mixes them together, but it
doesn't have a very good measure of how guessable those events were.
PRNG algorithms like Fortuna seek to guarantee that the PRNG will
eventually reach an unguessable state (by ensuring that more and more
possible entropy is used each time the internal state is updated, until
eventually an update happens that the attacker can't brute-force), but
they can't tell you when they've reached that point.

As for the second threat, at least on Linux, /dev/random output still
comes from the PRNG.  It just keeps an internal counter and blocks when
the PRNG has run out of its guess at estimated input entropy.  (This
is exceedingly silly, because a PRNG doesn't use up input entropy as
it generates results; either it has unguessable internal state or it
doesn't.)  An application can only protect against a poor system PRNG by
implementing its own generator, and it's far simpler to declare it the
system's responsibility to fix its PRNG if there's a security issue
associated with it.


Re: [RFC] ra_svn::make_nonce: how to cope with entropy shortages?

2011-11-03 Thread Greg Hudson
On 11/03/2011 05:10 PM, Jonathan Nieder wrote:
 Why would that be?  When someone dumps in 20 bits of data from a
 strong, in-hardware, random number source, even if the PRNG is utterly
 stupid, it can have an unguessable 20 bits of internal state.  After
 reading enough random numbers, I will have enough information to guess
 the seed and learn what comes next.

If you want to attack a PRNG, you need very little of the output
state--only enough to distinguish between the possible values of the
generator seed.  What you do need is for the generator seed to be
partially guessable; otherwise, you will be trying to brute-force a
128-bit or 256-bit seed, which is impractical.

If I somehow know the initial generator state, and you reseed your
generator with only 20 unguessable bits, I will be able to determine
those bits using 20 bits of output and 2^20 effort (which is easy), and
then I will know all of the generator state again.  However, if you
reseed with enough unguessable bits that I can't brute-force them, it
doesn't matter how much output I see; I will never again be able to
determine the internal state.

For the Fortuna generator, for instance, if I discover a way to
determine the generator state solely by observing the output, then I
will also have discovered a plaintext recovery attack against AES-256.

For more, see chapter 9 of _Cryptography Engineering_.

 A good PRNG helps mitigate that somewhat

More than somewhat.  Any PRNG which doesn't have the above properties
in its generator is insecure for any cryptographic purpose, and would be
considered a security bug in the operating system.

In another message, Peter Samuelson wrote:
 apr_time_now() has microsecond resolution.

It has microsecond precision but not necessarily microsecond accuracy.
For instance,
http://fixunix.com/kernel/37-gettimeofday-resolution-linux.html
suggests that two requests arriving within a 10ms window could get the
same nonce.


Re: Thoughts about issue #3625 (shelving)

2011-09-09 Thread Greg Hudson
On Fri, 2011-09-09 at 08:09 -0400, Greg Stein wrote:
 Greg Hudson said this is more akin to git stash than branches. I
 haven't used git's stashes to see how it actually differs from
 branches. I guess it is simply that changing branches leaves local
 mods, rather than stashing pseudo-reverts the local mods.

* Branches record tree states, while stashes record changesets as
applied to the working copy and index.  You can git stash, git
checkout branchname, and git stash pop the changeset such that it is
now applied against the new branch, if it applies cleanly.  You can do
similar things with branches using rebase, but the sequence of
operations would be different and more complicated.

* Stashes are a working copy feature, and aren't part of the history
model.  This isn't necessarily an interesting distinction for us, but it
has some consequences within the universe of git--a subsidiary
repository's git fetch won't retrieve stashes, they won't be in the
target space of commit specifiers, you don't have to create commit
messages for them, etc..

Stashes don't make git more expressive than local branches and rebase,
but in some ways it's a useful UI concept to keep them separate.

 Mercurial calls it shelving.

Aha.  I'll note that shelving isn't a core feature of Mercurial but an
extension.  Even if there are aliases so the command is accessible via
both names, the feature needs to have a primary name (which will be how
it's documented in the book, etc.).




Re: Thoughts about issue #3625 (shelving)

2011-09-08 Thread Greg Hudson
On Thu, 2011-09-08 at 23:43 -0400, Greg Stein wrote:
 I've had to use git lately, and our shelves could almost look like
 git's branches. Swap around among them based on what you're doing at
 the time.

I think this is closer to git's stash feature than git branches.  In
fact, I was thinking of jumping in and asking why this was being called
something gratuitously different.




Re: RE: Proxy authentication with Negotiate uses wrong host

2011-08-25 Thread Greg Hudson
On Wed, 2011-08-24 at 07:42 -0400, 1983-01...@gmx.net wrote:
 Are you refering to sole Kerberos or are you just concerned about
 transport encryption? Your statement somewhat irritates me.
 Given that the HTTP traffic cannot be securely wrapped into the GSS
 content and nor the SASL QOP can be set (like for LDAP), I would
 neglect that and still say TLS is not of your concern but of mine or
 the users in general.

Any authentication-only mechanism used over an insecure channel is
vulnerable to MITM attacks which preserve the authentication and change
the data.  Of course, this applies to HTTP basic and digest over raw
HTTP just as much as it does to negotiate, so perhaps it doesn't make
sense to restrict negotiate auth to HTTPS only on this basis alone.

A further concern with HTTP negotiate is that it is scoped to the TCP
connection and not to a single HTTP request.  Ignorant proxies may
combine TCP connections for multiple users' requests and inadvertently
authenticate one users' requests with anothers' credentials.  I may be
wrong, but I believe this is the concern which leads implementations to
restrict NTLM to HTTPS.  Switching from NTLM to Kerberos does not
mitigate this concern at all.  If there are other vulnerabilities in
NTLM which don't presuppose an MITM attack, perhaps I'm wrong.




RE: Proxy authentication with Negotiate uses wrong host

2011-08-24 Thread Greg Hudson
On Wed, 2011-08-24 at 05:52 -0400, Bert Huijben wrote:
 Then somebody added Kerberos support to neon, but the api wasn't
 updated to allow different behavior for the specific implementations.

Kerberos via HTTP negotiate is also insecure when not used over HTTPS.
In HTTP negotiate, the GSSAPI mechanism (Kerberos) isn't used to protect
the data stream, only to authenticate.  So you still need a secure
channel.

(Also, negotiate auth does no channel binding, which means Kerberos
provides no additional protection against MITM attacks on the TLS
channel.  That just means it's still important for the client to verify
the server cert.  I've heard that Microsoft has some extensions to RFC
4559 to do channel binding, but I don't know any details and Neon almost
certainly doesn't have any support for it.)





Re: Did we have ^/clients?

2011-08-16 Thread Greg Hudson
On Tue, 2011-08-16 at 14:14 -0400, Daniel Shahaf wrote:
 r6881 implies that a ^/clients directory existed until r6880:
 https://svn.apache.org/viewvc/subversion/README?r1=846955r2=846954pathrev=846955diff_format=f
 
 kfogel on IRC recalls it having existed.

I remember svn (the command) living under subversion/clients/svn and
being moved to subversion/svn.

If there's no evidence of this in our Subversion history, maybe the move
happened back when we were still using CVS.  (I don't believe we
preserved our CVS history when we started self-hosting, because cvs2svn
was a difficult and not-yet-solved problem.)




Re: It's time to fix Subversion Merge

2011-07-11 Thread Greg Hudson
On Mon, 2011-07-11 at 12:48 -0400, Mark Phippard wrote:
 2. Subversion does not handle move/rename well (tree conflicts)
[...]
 When this problem was first approached (before we came up
 with tree conflicts) it hit a brick wall where it seemed a new
 repository design was needed:

It's worth considering that git has a reputation for good merge support
even though it has no commit-time copy/rename history whatsoever in its
history model.  By contrast, bzr paid a lot of attention to container
history and merge support in the face of tree reorgs, and it clearly
isn't as much of a killer feature as its designers had expected
(http://www.markshuttleworth.com/archives/123).

So, one possible way forward is to decide that copy history is just a
hint for svn log and that merging should ignore it.




Re: [PATCH] Fix for issue 3813

2011-06-22 Thread Greg Hudson
On Wed, 2011-06-22 at 02:29 -0400, Daniel Shahaf wrote:
 From looking at the code, svn_io_open_unique_file3() would force the
 file to have a mode of 0600|umask() instead of just 0600

The umask removes file permissions from the mode argument to open(); it
doesn't add permissions.  (Unless there's something unusual about this
code.)




Re: diff wish

2011-06-15 Thread Greg Hudson
On Wed, 2011-06-15 at 09:38 -0400, Johan Corveleyn wrote:
 But I don't like the hand-waving discussion that it will always be
 superior, period. That's just not true. And it would be a big mistake,
 IMHO, to only support a heuristic diff.

If it's a big mistake to use a heuristic diff by default, then adding
options to change the diff algorithm will not mitigate this mistake.

Similarly, adding options to support a heuristic diff as not-the-default
is almost completely useless.

I know from experience that it's very easy to stare at a problem for
long enough to convince yourself that other people care about it as much
as you do, but in reality, to a very good approximation, nobody wants to
play around with diff algorithm options.  There are probably a few dozen
people out there who have configured git diff to use --patience by
default and like it, but in the scheme of things, it's dead code.

Options come at a cost in code complexity and documentation bulk.
Supporting options for the sake of a very small fraction of users,
without strong evidence of a compelling need for those users, is not the
right tradeoff for a code base.




Re: diff wish

2011-06-15 Thread Greg Hudson
On Wed, 2011-06-15 at 11:30 -0400, Johan Corveleyn wrote:
 Okay, I guess we should then also rip out --ignore-space-change and
 --ignore-eol-style, and perhaps --show-c-function. Or, if it's
 preferred that ignore-space-change and ignore-eol-style be used by
 default (because humans are normally not interested in changes in
 amount of whitespace), we should use those options by default, and
 not provide an option to disable them. Fine by me.

Those are not options for determining the diff algorithm.  They are
options for preprocessing the diff inputs or postprocessing the output.
Although they're probably only used by a small minority of users,
there's pretty strong evidence of a compelling need for them.




Re: svn commit: r1136114 - /subversion/trunk/configure.ac

2011-06-15 Thread Greg Hudson
On Wed, 2011-06-15 at 13:28 -0400, Philip Martin wrote:
 Do they all support -s?

cmp -s is one of the most portable Unix command invocations in existence
(from general knowledge; I can't give a reference).




Re: Improvements to diff3 (merge) performance

2011-06-13 Thread Greg Hudson
On Mon, 2011-06-13 at 07:00 -0400, Morten Kloster wrote:
 I assume he has discussed this elsewhere in more detail? The link
 you provided says very little about it (and the ONLY hit for implicit
 cherrypicking on Google was your post :-).

Yes, but I'm not sure where any more, unfortunately.  Possibly here:
http://lists.zooko.com/pipermail/revctrl/ but that's a big archive to
look through.

Complicating matters, Codeville merge operates on the entire history of
the two nodes, rather than just a common base.

 As mentioned above, my original proposal was somewhat more
 aggressive than strictly necessary for my purposes.

I think if you limit the merging to strictly larger changes between sync
points, the false negative rate shouldn't go up too much.

  Also, I think
 the user should be allowed to specify how aggressive the merge
 algorithm should be as an option.

Perhaps, but most users aren't going to want to fiddle with merge
options, so the onus is still on the system to pick a good default.  (It
does help if the options immediately make sense, which they do in this
proposal.  Options like git's --patience and --strategy octopus are
especially unlikely to be used productively, I would think.)





Re: Improvements to diff3 (merge) performance

2011-06-12 Thread Greg Hudson
My executive summary of your post is that you want diff3 to try to merge
related, but not identical, changes occuring between a pair of sync
points.  I'm wary about this for two reasons.

First, the benefit appears to arise chiefly for what Bram Cohen calls
implicit cherrypicking use cases--that is, cases where a change is
merged and then merged again together with other changes.  After
extensive research, Bram eventually concluded that trying to support
this is a bad idea (http://bramcohen.livejournal.com/52148.html).  I
tend to think that a merge algorithm should not increase its false
negative rate for the benefit of implicit cherrypicking.

Second, I can see a lot of potential for incorrect merges based on
superficial similarities.  For instance, if you and I both add code
between the same two sync points, and the last line of my change is the
same as the first line of yours (perhaps a blank line), that could be
enough to unambiguously determine an ordering.  Perhaps both of our code
additions do the same thing in different ways, and now it gets done
twice (which is almost never desirable).  Certainly, the existing
algorithm can produce incorrect merges too, but my intuition says that
the practical likelihood would become much higher.

Of course, I could be off base.  All merge algorithms are heuristic, and
it would take a lot of research to really compare the efficacy of one
against another.  You need a cost function to determine the tradeoff
between the false negative rate and the false positive rate, and you
also need to measure how any given algorithmic change affects the false
negative and false positive rates in practice.  Both of these seem
really hard.

It would definitely affect my opinion if I learned that the three-way
merge algorithms in other popular version control systems worked this
way, or if I learned that Subversion was more restrictive than, say, gnu
diff3.



Re: strange error message

2011-05-18 Thread Greg Hudson
On Wed, 2011-05-18 at 14:24 -0400, Stefan Küng wrote:
 the not to all point... just doesn't sound right.

It's a split infinitive, which doesn't make it necessarily bad English
but can make it sound wrong.  Not to point to the same repository
would be more concise and just as precise, in my opinion.




Re: Why do we include debug symbols in !maintainer_mode?

2011-04-14 Thread Greg Hudson
On Thu, 2011-04-14 at 14:25 -0400, Philip Martin wrote:
 I believe it is a GNU standard.  Debug symbols can be used with an
 optimised build although it is obviously easier to debug without
 optimisation

More specifically: stepping through a -g -O2 executable is pretty
painful, but you can still usually get a decent stack trace from one.




Re: Is the svn:// protocol secure when encrypted via SASL?

2011-02-21 Thread Greg Hudson
On Mon, 2011-02-21 at 14:48 -0500, Keith Palmer Jr. wrote:
 Nothing in what you just copy-pasted indicates whether it's *the
 actual data stream* that's being encrypted, or just the
 *authentication*. I need to know if the checked-out files that are
 being transferred are encrypted or not. 

The SASL security layer refers to protection of the actual data
stream.  Encryption of the authentication isn't really a meaningful
concept in SASL parlance; mechanisms always perform authentication steps
as securely as they are able.




Re: Deltifying directories on the server

2011-02-01 Thread Greg Hudson
On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote:
 I can only really speak for the BDB side of things, but... what he said.

I'll elaborate a little bit.  API issues aside, we're used to putting
artifacts from different versions in different places.  More so in FSFS,
where it was baked into the initial architecture, but also in BDB for
the most part.

The most efficient storage for large directories which frequently change
by small deltas would be some kind of multi-rooted B-tree.  To do that
efficiently (that is, without scattering each tiny change into a
separate disk block, requiring lots and lots of opens/seeks/reads),
you'd want to put artifacts from different versions of a directory all
in the same place.  You might be able to arrange it so that modifying a
directory is an append-only operation, avoiding the need for a lot of
copying, but you'd still want a place to append to for each directory,
which isn't how FSFS or BDB works.

So, I'm not sure we can ever have efficient handling of this use case
without designing a completely new back end--which wouldn't be a
terrible idea if someone's up to it.





Re: svn commit: r1064168 - in /subversion/trunk/subversion/include/private: svn_eol_private.h svn_fs_util.h svn_mergeinfo_private.h svn_opt_private.h svn_sqlite.h svn_wc_private.h

2011-01-27 Thread Greg Hudson
On Thu, 2011-01-27 at 21:46 -0500, Senthil Kumaran S wrote:
 A NULL does mean '\0' or (void *) 0x. I also referred this -
 http://en.wikipedia.org/wiki/Null_character which says the same when
 referring to NULL termination of a string, except for one place where
 it says 'NUL' is an abbreviation for NULL character -
 http://en.wikipedia.org/wiki/NUL

NULL (all caps) is a C preprocessor constant used to denote a null
pointer.

NUL (all caps) is sometimes used as an abbreviation for the null
character.

Null (uncapitalized except as appropriate for beginning a sentence) is
an noun or adjective which can be used in a variety of contexts.  You
will note that null is never written in all caps in the Wikipedia
articles you referenced.

So a NULL-terminated string is a meaningless concept; you can't use a
null pointer to terminate a character string.  A NUL-terminated string
is meaningful, as is a null-terminated string or
null-character-terminated string.




Re: gpg-agent branch treats PGP passphrase as repository password?

2010-12-06 Thread Greg Hudson
On Mon, 2010-12-06 at 07:30 -0500, Daniel Shahaf wrote:
 Ideally, Subversion won't know the PGP passphrase.  (If it does, then
 a malicious libsvn_subr can compromise a private key.)

I think you're trying to solve a different problem here.  The goal is to
minimize typing of passwords without storing passwords in a fixed
medium, not to protect keys against malicious or broken Subversion code.

 For comparison, the ssh-agent protocol[1] only allows a client of the
 agent to authenticate himself (using the agent) to a third party, but
 does not have a Retrieve secret key option [2].  If we are to use PGP,
 could we find a solution with similar properties?

ssh-agent has special knowledge of the operations which will be
performed using the keying material.  PGP probably doesn't have any
interest in the operations Subversion needs to do with passwords.

PKCS#11 is the most commonly used general API for operations where an
application can use a key but isn't allowed to know what it is.  The
most useful free software implementation of PKCS#11 is probably NSS.  I
don't think we want to go there, though.




Re: [Proposed] Split very long messages by paragraph for easy translate

2010-11-13 Thread Greg Hudson
On Sat, 2010-11-13 at 10:31 -0500, Daniel Shahaf wrote:
 Sounds reasonable.
 
 What changes to the source code would be required?
 
 Do we just change
   N_(three\n\nparagraphs\n\nhere\n)
 to
   N_(three\n) N_(paragraphs\n) N_(here\n)

No, that would just result in evaluating gettext on the combined string,
same as before.  I can see two options for help strings in particular:

1. Rev svn_opt_subcommand_desc2_t to include an array of help strings
which are translated and displayed in sequence.

2. Change print_command_info2 to look at the help string and break it up
at certain boundaries (such as blank lines or numbered list entries)
before translating it.

(Mercurial is written in Python, so it has different constraints.)




Re: FSv2 (was: FREE Apache Subversion Meetup...)

2010-10-19 Thread Greg Hudson
On Tue, 2010-10-19 at 04:31 -0400, Greg Stein wrote:
 The FSFS backend was dropped in as a fait d'accompli.

A minor correction: ra_svn was dropped in as a fait d'accompli.  FSFS
was, as far as I remember, a pretty open process where I created a
design and Josh Pieper implemented it.  You can look at the commit
history of libsvn_fs_fs to see that, and I'm pretty sure that Josh and I
were working over the expected open channels (dev list and IRC) at the
time.




Re: svn commit: r1003986 [1/2] - in /subversion/trunk/subversion: libsvn_client/ libsvn_fs_base/ libsvn_fs_base/bdb/ libsvn_fs_fs/ libsvn_ra_local/ libsvn_ra_neon/ libsvn_ra_serf/ libsvn_ra_svn/ libsv

2010-10-04 Thread Greg Hudson
On Mon, 2010-10-04 at 06:14 -0400, Julian Foad wrote:
 The NULL macro is intended for use as a pointer.

Only when statically cast to the appropriate pointer type.  This happens
automatically in many contexts, such as assignments or prototyped
function parameters.  But it does not happen automatically for variable
parameters of a stdarg function.

So apr_pstrcat(foo, bar, NULL) really is invalid C code.  It's not a
practical concern because common platforms use a single pointer
representation, but it's a fair warning for a compiler to give.

This message brought to you by Language Lawyers Inc.




Re: svn commit: r1003986 [1/2] - in /subversion/trunk/subversion: libsvn_client/ libsvn_fs_base/ libsvn_fs_base/bdb/ libsvn_fs_fs/ libsvn_ra_local/ libsvn_ra_neon/ libsvn_ra_serf/ libsvn_ra_svn/ libsv

2010-10-04 Thread Greg Hudson
On Mon, 2010-10-04 at 12:06 -0400, Julian Foad wrote:
 The issue at hand is when NULL is defined as an unadorned '0' *and* is
 passed to a variadic function such as apr_pstrcat.  If that's not a
 practical concern, that must be because the size and representation of
 (int)0 is the same as (char *)0.  If that is true on all supported
 platforms, then omitting the casts is a valid option; otherwise, we need
 them.

I don't think you have it quite right.

If NULL is defined as, say, (void *)0, then the compiler might not warn,
but it's still not correct C to pass a void * to a variadic function and
then read it out as some other pointer type.  void * is only guaranteed
to work if it's cast (implicitly or implicitly) to the correct pointer
type.  We don't see this problem as run-time errors because there is
typically no runtime transformation between void * and other pointer
types.

If NULL is defined as 0, it would be easy to imagine a practical problem
where you pass a 32-bit 0 value to a variadic function and then read it
back out as a 64-bit pointer type.  We don't see this problem in
practice because NULL is typically defined as 0L rather than just 0, and
sizeof(long) typically matches the size of pointers.

In terms of code hygiene, passing an uncasted NULL to a variadic
function isn't any worse than using calloc() to initialize pointer
fields in structures.  But if your compiler is giving you grief about
it, the best thing to do is probably to get used to casting in that
situation.

Phillip wrote:
 In C++ the cast should be more common since a conforming NULL cannot
 have a cast but, in the free software world at least, GCC uses
 compiler magic to make plain NULL work as a pointer without a cast.

Unlike C, C++ doesn't allow implicit casts from void * to other pointer
types, so defining NULL as (void *)0 would be pretty inconvenient there.




Re: Extensible changeset format proposal

2010-08-26 Thread Greg Hudson
On Thu, 2010-08-26 at 05:57 -0400, anatoly techtonik wrote:
 Don't you think it is time to design an extensible changeset format
 for exchanging information about changesets between systems?

Mostly for your entertainment, see:

http://www.red-bean.com/pipermail/changesets/2003-April/thread.html

There was an attempt to create a unified cross-system changeset format
seven years ago, but it didn't get very far.  However, the principals
are different today and more is known about the space of successful DVCS
tools.




Re: Looking to improve performance of svn annotate

2010-08-17 Thread Greg Hudson
On Tue, 2010-08-17 at 09:26 -0400, Johan Corveleyn wrote:
 Greg, could you explain a bit more what you mean with
 edit-stream-style binary diffs, vs. the binary deltas we have now?
 Could you perhaps give an example similar to Julian's? Wouldn't you
 have the same problem with pieces of the source text being copied
 out-of-order (100 bytes from the end/middle of the source being copied
 to the beginning of the target, followed by the rest of the source)?

Let's take a look at the differences between a line-based edit stream
diff (such as what you'd see in the output of diff -e) and a binary
delta as we have in Subversion.

The most obvious difference is that the former traffics in lines, rather
than arbitrary byte ranges, but the actual differences are much deeper.
A line-based diff can be modeled with the following instructions:

  * Copy the next N lines of source to target.
  * Skip the next N lines of source.
  * Copy the next N lines of new data to target.

After applying a diff like this, you can easily divide the target lines
into two categories: those which originated from the source, and those
which originated from the diff.  The division may not accurately
represent the intent of the change (there's the classic problem of the
mis-attributed close brace, for instance; see
http://bramcohen.livejournal.com/73318.html), but it's typically pretty
close.

Subversion binary deltas have a more flexible instruction set, more akin
to what you'd find in a compression algorithm.  The source and target
are chopped up into windows, and for each window you have:

  * Copy N bytes from offset O in the source window to target.
  * Copy N bytes from offset O in the target window to target.
  * Copy the next N bytes of new data to target.

There is no easy way to divide the target into source bytes and diff
bytes.  Certainly, you can tag which bytes were copied from the source
window, but that's meaningless.  Bytes which came from the source window
may have been rearranged by the diff; bytes which came from new data may
only have done so because of windowing.

The optimization idea is to create a new kind of diff (or more likely,
research an existing algorithm) which obeys the rules of the line-based
edit stream--no windowing, sequential access only into the source
stream--but traffics in bytes instead of lines.  With such a diff in
hand, we can divide the target bytes into source-origin and diff-origin,
and then, after splitting the target into lines, determine which lines
are tainted by diff-origin bytes and therefore should be viewed as
originating in the diff.




Re: Looking to improve performance of svn annotate

2010-08-12 Thread Greg Hudson
On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
 I'm wary of embedding any client functionality in the server, but I
 guess it's worth considering if it would be that useful.  If so, let's
 take great care to ensure it's only lightly coupled to the core server
 logic.

Again, it's possible that binary diffs between sequential revisions
could be used for blame purposes (not the binary deltas we have now, but
edit-stream-style binary diffs), which would decouple the
line-processing logic from the server.

(But again, I haven't thought through the problem in enough detail to be
certain.)




Re: Looking to improve performance of svn annotate

2010-08-11 Thread Greg Hudson
On Wed, 2010-08-11 at 19:14 -0400, Johan Corveleyn wrote:
 I naively thought that the server, upon being called get_file_revs2,
 would just supply the deltas which it has already stored in the
 repository. I.e. that the deltas are just the native format in which
 the stuff is kept in the back-end FS, and the server wasn't doing much
 else but iterate through the relevant files, and extract the relevant
 bits.

The server doesn't have deltas between each revision and the next (or
previous).  Instead, it has skip-deltas which may point backward a
large number of revisions.  This allows any revision of a file to be
reconstructed in O(log(n)) delta applications, where n is the number of
file revisions, but it makes what the server has lying around even
less useful for blame output.

It's probably best to think of the FS as a black box which can produce
any version of a file in reasonable time.  If you look at
svn_repos_get_file_revs2 in libsvn_repos/rev_hunt.c, you'll see the code
which produces deltas to send to the client, using
svn_fs_get_file_delta_stream.

The required code changes for this kind of optimization would be fairly
deep, I think.  You'd have to invent a new type of diffy delta
algorithm (either line-based or binary, but either way producing an edit
stream rather than acting like a compression algorithm), and then
parameterize a bunch of functions which produce deltas, and then have
the server-side code produce diffy deltas, and then have the client code
recognize when it's getting diffy deltas and behave more efficiently.

If the new diffy-delta algorithm isn't format-compatible with the
current encoding, you'd also need some protocol negotiation.




Re: Bikeshed: configuration override order

2010-08-10 Thread Greg Hudson
On Tue, 2010-08-10 at 14:24 -0400, C. Michael Pilato wrote:
 The foremost bit of client configuration that CollabNet's Subversion
 customers are demanding (besides auto-props, which I think we all agree on)
 is a way for the server to set a policy which dictates that clients may not
 use plaintext or other insecure password storage mechanisms.

I don't expect anyone to consider my opinion blocking, but I think this
is a questionable area for any kind of software to delve into.  I've
only seen this kind of client control in one other context (a branded
Jabber client), and never in an open source project. (*)

Lots and lots of clients are able to remember passwords: web browsers,
email clients, IM clients.  Lots of central IT organizations (MIT's
included) don't like this feature and recommend that users not use it.
Lots of users do it anyway.  I don't know of a single piece of
widely-used client software which allows the server to turn off password
memory.

(*) Actually, on consideration, there was some flap about the okay to
print flag in PDF documents, or something related to that.  I can't
remember how it turned out.




Re: Bikeshed: configuration override order

2010-08-07 Thread Greg Hudson
On Sat, 2010-08-07 at 07:58 -0400, Daniel Shahaf wrote:
 Stefan Küng wrote on Sat, Aug 07, 2010 at 12:59:26 +0200:
  On 07.08.2010 12:44, Daniel Shahaf wrote:
  If corporations want to have configuration override, fine.
 
  But I want a way to disable that completely.  I don't necessarily want to
  allow a random sourceforge repository to control my auto-props settings for
  a wc of that repository.
 
  Maybe a stupid question: why not?
 
 Why don't I let ezmlm configure my mailer's use html? setting?

I think he was asking for an answer specifically relating to auto-props,
not an answer about configuration in general.  There's not generally a
lot of room for individual disagreement about what auto-props should be
for a given project.

My thinking about repository configuration is that the uses cases should
be divided into two categories:

  1. Stuff that isn't really client configuration at all, like
auto-props, should come from the repository instead, and should only
come from client configuration for compatibility.

  2. Stuff that is client configuration should only come from client
configuration.  Client control is not legitimate business for an open
source product, though it could be the business of a proprietary
value-added fork.

Note that there's no general extension of the config framework here, no
whitelisting or blacklisting, no override settings.  Invent a mechanism
for getting repository configuration from the server and apply it to the
specific use cases in (1), falling back to client configuration as a
legacy mechanism.




Re: Proposal: Change repository's UUID over RA layer

2010-08-06 Thread Greg Hudson
When I've mirrored repositories with the intent of keeping them in sync,
I've typically given them the same UUID.  I don't know if that has much
impact in practice, since I think working copies tend to stick to one of
the mirrors (either the RW master or the RO slave).

The philosophical question here isn't whether the ID is universally
unique but what it's identifying.  Is it identifying the repository
content or the the container in which the content is held?




Re: Bug: svnserve fail to detect it is already running

2010-07-09 Thread Greg Hudson
On Fri, 2010-07-09 at 11:44 -0400, Stefan Sperling wrote:
 As far as I can tell there is little we can do to secure svnserve
 against this attack on Windows systems other than Server 2003,
 because APR won't let us set the SO_EXCLUSIVEADDR option.

That's okay, we don't want the SO_EXCLUSIVEADDR behavior.  We want the
default behavior under Windows, which corresponds to the SO_REUSEADDR
behavior under Unix.




Re: Suggestion: Transparent Branching

2010-07-07 Thread Greg Hudson
On Wed, 2010-07-07 at 11:44 -0400, Marco Jansen wrote:
 So therefor, what we would like to see is to be able to have a transparent
 branch: One which fetches updates from both branch and trunk, without having
 them listed as changes or triggering commits. In essence it's reading from
 two branches, where a last known revision of a file could be from either
 branch, and committing to one only when it has changes from this 'either'
 latest revision.

I'm not sure if this is a feature of any popular version control system.
What would happen if trunk changes didn't merge easily with the changes
on one or more transparent branches?




Re: Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs

2010-07-01 Thread Greg Hudson
On Thu, 2010-07-01 at 08:56 -0400, michael.fe...@evonik.com wrote:
 I better already start to run for it, 
 when I ever approve the use of the current implementation of the 
 representation cache.

Here's what this says to me: it doesn't matter what the real risks are;
it only matters that the quantifiable mathematical risks I know about be
reduced to 0, regardless of the cost.

That's sometimes a rational attitude to take in the world of legalities
and politics.  In the world of engineering, it's not popular, which is
why hash-indexed storage and cryptography are used in a wide variety of
applications.  It's pointless to reduce the quantifiable risks from
2^-(many) to 0 when we know that the human factors and mechanical risks
are much larger.





Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-25 Thread Greg Hudson
On Fri, 2010-06-25 at 08:45 -0400, michael.fe...@evonik.com wrote:
 I am actually more interested in finding reliable solution 
 instead of discussing mathematics and probabilities.

The discussion of probabilities affects whether it would be justifiable
to change Subversion to address hash collisions.

 1. You are comparing apples and oranges. 
 2. you can't balance the possibility of one error 
with the that of an other.

All systems have a probability of failure, resulting from both human and
mechanical elements within the system.  It may be difficult to estimate
precisely, but one can often establish a lower bound.  The question is
whether hash-indexed storage increases the probability of failure by a
non-negligible amount.

It often results in something like: 
  square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)

We're discussing failure rates, not margin of error propagation.
Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...)
if the failure probabilities are independent.

If your system has a probability of failure of one in a million from
other factors, and we add in an independent failure probability of one
in 2^32 from hash-indexed storage, then the overall system failure
probability is one in 999767--that is, it doesn't change much.

 3. you over estimate the risk of undetected hardware faulty.

I think you over-estimate the risk of hash-index storage collisions.

There is no evidence that the hash vales are 
equally distributed on the data sets, which is import for 
the us of hashing method in data fetching.

A hash which had a substantially unequal distribution of hash values
among inputs would not be cryptographically sound.

 = 3,21*10^2427 sequences of Data of 1K size 
 represented by the same hash value.

First, SHA-1 is a 160-bit hash, not a 128-bit hash.  Second, the number
you calculated does not inform the probability of a collision.

If you have N samples, which are not specifically constructed as to
break SHA-1, then probability of a SHA-1 collision is roughly N^2 /
2^160 (see birthday paradox for more precision).  So, for example,
with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32
probability of a SHA-1 collision in the rep cache.  If you can construct
a system with close to a one in four billion probability of error from
other sources, kudos to you; if not, hash-indexed storage is not
perceptibly increasing your error rate.




Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-24 Thread Greg Hudson
On Thu, 2010-06-24 at 11:29 -0400, michael.fe...@evonik.com wrote:
 We must ensure that the data in the repository is, without any concerns, 
 the data we have once measured or written. 

You do realize that the probability of data corruption due to faulty
hardware is much, much more likely than the probability of corruption
due to a rep-sharing SHA-1 collision, right?




Re: Optimizing back-end: saving cycles or saving I/O (was: Re: [PATCH v2] Saving a few cycles, part 3/3)

2010-05-12 Thread Greg Hudson
On Wed, 2010-05-12 at 13:44 -0400, Hyrum K. Wright wrote:
 There may be other ways of caching this information, which would be great.

Maybe.  Caches add complexity, and too many layers of caching (disk, OS,
application) can actually reduce performance in some scenarios.

I would suggest getting a better understanding of why this operation is
slow.  Why is svn log opening each rev file ten times?  Is this
intrinsically necessary?  Going straight to optimizing the overused
low-level operations can provide a noticeable performance benefit, but
fixing the inefficient high-level algorithms is how you can turn minutes
into microseconds.




Re: description of Peg Revision Algorithm is incomplete

2010-03-29 Thread Greg Hudson
On Mon, 2010-03-29 at 12:07 -0400, Julian Foad wrote:
 Some possible interpretations are
 
   * Find the repository URL of './some/deep/file.c', and [...]

I'll mention a related interpretation, which is to use the repository
URL of the parent directory and append file.c to it.

This is a little weird, and probably only makes sense as a fallback if
the file doesn't have a URL (e.g. because it doesn't exist in the
working copy), but it would let you do things like svn cp
deleted-f...@1000 .

I may have filed an issue about this somewhere, possibly in the days
before peg-revs.




Re: Hook scripts start with an empty environment

2010-03-24 Thread Greg Hudson
Although I've always been aware of the design intent behind empty hook
script environments, I'll echo Tim's complaint that it's sometimes
inconvenient.  The problem most commonly crops up in svn+ssh:// or
file:// deployments where you want to run some action with user
credentials: updating a bug database, updating a shared read-only
working copy, updating a snapshot of the repository, etc..  If the
credentials are pointed to by environment variables (e.g. Kerberos
tickets), then the operation fails.  If they are pointed to by other
means (e.g. AFS tokens), then the operation may succeed anyway.

In some cases it may be possible to perform the action with server host
credentials instead, but that is not always a good option.

The design intent behind clearing the environment has two bases:
security and consistency.  In evaluating the security basis, you have to
consider the type of deployment:

  * http:// and svn:// access, where the client has little or no control
over the environment.  In this case propagating the environment carries
no risk because the environment is controlled by the admin.

  * file:// and svn+ssh:// access where the calling user has
unrestricted shell priveleges and there is no setuid or setgid bit on
the svn/svnserve binary.  In this case propagating the environment
carries no risk because the svn code is not executing with elevated
privilege.

  * file:// and svn+ssh:// access where the calling user has restricted
shell access (can only run svn/svnserve for some reason) or there is a
setuid or setgid bit on the svn or svnserve binary.  In this case
propagating the environment could open the door to unintended access.

It might be reasonable to have said from the start, if you're in the
third situation, then your hook scripts should clear their own
environments, but we can't start saying that in release 1.7.  We can
detect a setuid or setgid bit, but we cannot detect a restricted shell
situation (such as when .ssh/authorized-keys contains a command
directive), so we can't really intuit when it's safe to propagate the
environment.

It would be reasonable to make this controlled by repository
configuration, if there's a convenient place to put that bit.




Re: Subversion in 2010

2010-01-08 Thread Greg Hudson
On Fri, 2010-01-08 at 15:31 -0500, Paul Querna wrote:
 Profile everything, be faster at everything

There are smart people who will disagree with me on this, but I'm not
sure the best tool for improving Subversion performance is a profiler.
Historically a lot of our performance issues have come from algorithmic
inefficiencies, particularly from code in libsvn_repos.  You can get
maybe a 10-30% reduction in time on these operations by trying to
optimize the innermost bit of code which gets repeated over and over
again, but in some cases you can get an order of magnitude reduction in
time by fixing the algorithms.

It's been too long for me to remember any specific examples,
unfortunately.




Re: Subversion in 2010

2010-01-06 Thread Greg Hudson
On Wed, 2010-01-06 at 21:26 -0500, Mark Mielke wrote:
 There is a race between the pull 
 and push whereby somebody who pushes before I pull will cause my push to 
 fail, but we generally consider this a good thing as it allows us to 
 analyze the change and determine whether additional testing is required 
 before we try to submit pull and push again.

I've certainly heard this argument before, and I think it makes sense
for some projects.

However, I've heard from some people that they work on projects where
they would never be able to get any work done (that is, they would never
be able to commit anything in a reasonable amount of time) if they
always had to pull before pushing and hope that no one else pushes in
the mean time.  Those projects are simply too active to support a pull,
test/analyze, and push development model.

In some cases this is because the project has been defined to be
larger than it really needs to be.  For instance, the ASF repository
would be pretty frustrating to use if always you had to be up to date
before committing, but it's easy to see how the ASF might have created
separate repositories for each project instead of what it did.  In other
cases the project is not so easily subdivided.




Re: Subversion in 2010

2010-01-04 Thread Greg Hudson
On Mon, 2010-01-04 at 11:31 -0500, C. Michael Pilato wrote:
 To be a compelling replacement for git/Mercurial, perhaps?

That seems tough.  The major architectural differences between
git/Mercurial/Bazaar and Subversion are:

  * No commitment to mixed-revision working copies.
  * Full history of at least one branch is generally stored on clients.
  * DVCS workflow support.

For small projects and a certain class of developers, these can be huge
advantages.  For huge projects and a different class of developers,
these can be hindrances.

(See also http://svn.haxx.se/dev/archive-2008-04/1020.shtml)