On 1/23/2012 10:38 AM, Philip Martin wrote:
Garret Wilson<gar...@globalmentor.com> writes:
On 1/23/2012 9:55 AM, Philip Martin wrote:
I thought you were proposing to write the code?
I'm fine with that as well. Looks like I would have to add a few lines
to decote UTF-8 (surely such code already exists in the Subversion
codebase somewhere) and change a few if(...){} statements. I should be
able to handle that. I would imagine it will take more effort on my
part to get permission to change the code than actually writing the
code itself.
The function receives a string of bytes, I think it's already in UTF-8.
The problem is that while Subversion has functions to validate UTF-8 it
doesn't have a system for extracting individual UTF-8 code points. At
present it only ever needs to extract the ASCII subset which is trivial.
Ah. Well, like I said---I would be happy to write the UTF-8 extraction
code. It would be worth it to me to get this functionality in; it would
be a fun exercise for me; it would be a good introduction to the
codebase for me; it's a small (very small), low-risk task; and the
Subversion codebase would be better off in the end. (I'm sure it can be
used elsewhere.) It's a win-win for everyone! :D
This is really a small thing. Here's an example in just a few lines:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Or see DecodeUTF8BytesToChar at
tidy.sourceforge.net/cgi-bin/lxr/source/src/utf8.c .
I would be happy even precluding code points from supplementary planes
(e.g. those over U+FFFF), if anyone is worried about the code being too
complicated.
G