On Tue, May 31, 2011 at 01:41:54AM +0300, Daniel Shahaf wrote:
> How would you handle a repository that contains the following
> nodes/fspaths:
> 
> /foo/bår    (in UTF-8)
> /foo/bår    (in latin1)
> 
> ?
> 
> 
> How would you handle a repository that contains:
> /foo/barÉ   (in latin1)
> /foo/barŠ   (in latin2)
> 
> ?

All the ISO-8859 (latin) encodings are single-byte encodings.
It's not possible to know what the encoding is supposed to be if
paths in different ISO-8859 encodings entered the repository.
They all decode to different but valid strings of characters.

In the first iteration of this feature I would simply assume one
user-specified source encoding and try to convert data that isn't
UTF-8 from the source encoding to UTF-8.
In case multiple single-byte encodings are present this means that some
characters will be wrong but the repository will work again without
manual intervention. In case multiple multi-byte encodings other than
UTF-8 are present this approach can fail and might require manual fixing
(no worse than the current situation).
This could still be improved upon if necessary.
 
> > We should also make svnadmin verify complain if paths are not in UTF-8.
> 
> +1.
> 
> The validation that 'load' and 'commit' trigger is path_valid() in
> fs_loader.c.

Thanks for the hint. I'm now running tests on a patch for this.

Reply via email to