On Tue, May 31, 2011 at 01:41:54AM +0300, Daniel Shahaf wrote: > How would you handle a repository that contains the following > nodes/fspaths: > > /foo/bår (in UTF-8) > /foo/bår (in latin1) > > ? > > > How would you handle a repository that contains: > /foo/barÉ (in latin1) > /foo/barŠ (in latin2) > > ?
All the ISO-8859 (latin) encodings are single-byte encodings. It's not possible to know what the encoding is supposed to be if paths in different ISO-8859 encodings entered the repository. They all decode to different but valid strings of characters. In the first iteration of this feature I would simply assume one user-specified source encoding and try to convert data that isn't UTF-8 from the source encoding to UTF-8. In case multiple single-byte encodings are present this means that some characters will be wrong but the repository will work again without manual intervention. In case multiple multi-byte encodings other than UTF-8 are present this approach can fail and might require manual fixing (no worse than the current situation). This could still be improved upon if necessary. > > We should also make svnadmin verify complain if paths are not in UTF-8. > > +1. > > The validation that 'load' and 'commit' trigger is path_valid() in > fs_loader.c. Thanks for the hint. I'm now running tests on a patch for this.