On 1/23/2012 10:38 AM, Philip Martin wrote:
Garret Wilson<gar...@globalmentor.com>  writes:

On 1/23/2012 9:55 AM, Philip Martin wrote:
I thought you were proposing to write the code?
I'm fine with that as well. Looks like I would have to add a few lines
to decote UTF-8 (surely such code already exists in the Subversion
codebase somewhere) and change a few if(...){} statements. I should be
able to handle that. I would imagine it will take more effort on my
part to get permission to change the code than actually writing the
code itself.
The function receives a string of bytes, I think it's already in UTF-8.
The problem is that while Subversion has functions to validate UTF-8 it
doesn't have a system for extracting individual UTF-8 code points.  At
present it only ever needs to extract the ASCII subset which is trivial.

Ah. Well, like I said---I would be happy to write the UTF-8 extraction code. It would be worth it to me to get this functionality in; it would be a fun exercise for me; it would be a good introduction to the codebase for me; it's a small (very small), low-risk task; and the Subversion codebase would be better off in the end. (I'm sure it can be used elsewhere.) It's a win-win for everyone! :D

This is really a small thing. Here's an example in just a few lines: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Or see DecodeUTF8BytesToChar at tidy.sourceforge.net/cgi-bin/lxr/source/src/utf8.c .

I would be happy even precluding code points from supplementary planes (e.g. those over U+FFFF), if anyone is worried about the code being too complicated.

G

Reply via email to