Kenneth Porter schrieb:
On Wednesday, April 12, 2006 2:26 AM +0200 Dirk <[EMAIL PROTECTED]> wrote:
In order to solve this issue, I'm thinking of embedding my own
windows-1252 to UTF-8 conversion table into the ssphys programm. Then I
would output UTF-8 encoded strings and everybody would be happy.
Is there any code for copy'n pasting around?
There's a 1252-to-UTF8 conversion routine here:
<http://discuss.joelonsoftware.com/default.asp?joel.3.325282.13>
The author, Ben Bryant, initially says it's for 8859-1 but then
corrects himself in a subsequent post.
The trick is to pass anything outside the problem band (0x7f to 0x9f)
since it's already the same as UTF-8, and only do the lookup and
conversion for those 17 characters using a small table of equivalent
Unicode points and the appropriate UTF escape mechanism.
Actually it is not only the problem band, but it is more or less what I
thought about.
I think the CXMLFormatter wants to use UTF-8 for the output encoding.
I don't think you need to set the locale. Digging further, it looks
like the place to fix this is in sanitizeForXML in XML.cpp. Instead of
converting invalid characters to underscores, use the table lookup and
regenerate the string.
The formatter itseld isn't bound to UTF-8. When I dtarted, I didn't
thought that this causes so many problems, so I simply used the
setlocale to detect printeable charcters and the xml-encoding to define
the XML output to be in this locale. So the two lines
TiXmlDeclaration decl ("1.0", "windows-1252", "");
if (NULL == setlocale (LC_ALL, ".1252"))
must come in pair.
Later I recognized that the encoding setting is not enough in XML to
force the correct encoding since it is still discouraged to output
specific character.
I wrote the sanitizeForXML function before using the TinyXML library due
to the fact, that even the 1252 codepage contained empty character
points, and that we found these bytes within our comments. So the idea
was to simply translate them to an underscore. Later I didn't remove the
control characters from beeing translated, since these still caused
problems during the conversion step later, even if they are correctly
encoded into XML.
How should we treat a bell character in a comment? Translate it? What
about all other control characters? LF, FF, ... ? Esp. the comments seem
to very sensitive for corruption in VSS. I found a lot of them which
contained binary junk. Should we convert the junk?
I haven't yet dug far enough to see what consumes the resulting XML.
Does that understand the multi-byte UTF-8 characters that would result
from this conversion?
The other possibility is to ignore the conversion here and suppress
the sanitizer, and declare the encoding as Windows-1252, on the
assumption that the XML consumer knows how to read that. Again, don't
setlocale(), as we're not using the locale features of iostream to
render the output (I don't think). Or maybe setlocal("C") to suppress
any locale processing. The setlocale can happen around the call to
TiXmlDocument::Print in ~CXMLFormatter to isolate its effect to the
file output.
That's what I did. But it didn't worked. According to
http://www.w3.org/TR/REC-xml/#charsets some characters, that are still
allowed in the the windows-1252 codepage, are discouraged in XML. esp.
most of the characters in the band [x80-x9f].
It looks like tinyxml uses fprintf for everything. If it really wanted
to be tiny, it probably should have used fwrite. I don't think that
pays attention to the locale and treats everything as binary.
Nope, I uses its own PutString function to convert the characters, since
during the output a few characters are encoded into its hexadecimal
character reference. This function uses the isspace, iscntrl, ...
functions which heavily rely on the locale. It finally only uses fprintf
to write the generated string into the file. It could have used fwrite also.
Dirk
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org