Kenneth Porter schrieb:
On Wednesday, April 12, 2006 2:26 AM +0200 Dirk <[EMAIL PROTECTED]> wrote:

In order to solve this issue, I'm thinking of embedding my own
windows-1252 to UTF-8 conversion table into the ssphys programm. Then I
would output UTF-8 encoded strings and everybody would be happy.

Is there any code for copy'n pasting around?

There's a 1252-to-UTF8 conversion routine here:

<http://discuss.joelonsoftware.com/default.asp?joel.3.325282.13>

The author, Ben Bryant, initially says it's for 8859-1 but then corrects himself in a subsequent post.

The trick is to pass anything outside the problem band (0x7f to 0x9f) since it's already the same as UTF-8, and only do the lookup and conversion for those 17 characters using a small table of equivalent Unicode points and the appropriate UTF escape mechanism.

Actually it is not only the problem band, but it is more or less what I thought about.
I think the CXMLFormatter wants to use UTF-8 for the output encoding. I don't think you need to set the locale. Digging further, it looks like the place to fix this is in sanitizeForXML in XML.cpp. Instead of converting invalid characters to underscores, use the table lookup and regenerate the string.
The formatter itseld isn't bound to UTF-8. When I dtarted, I didn't thought that this causes so many problems, so I simply used the setlocale to detect printeable charcters and the xml-encoding to define the XML output to be in this locale. So the two lines

   TiXmlDeclaration decl ("1.0", "windows-1252", "");
   if (NULL == setlocale (LC_ALL, ".1252"))

must come in pair.

Later I recognized that the encoding setting is not enough in XML to force the correct encoding since it is still discouraged to output specific character.

I wrote the sanitizeForXML function before using the TinyXML library due to the fact, that even the 1252 codepage contained empty character points, and that we found these bytes within our comments. So the idea was to simply translate them to an underscore. Later I didn't remove the control characters from beeing translated, since these still caused problems during the conversion step later, even if they are correctly encoded into XML.

How should we treat a bell character in a comment? Translate it? What about all other control characters? LF, FF, ... ? Esp. the comments seem to very sensitive for corruption in VSS. I found a lot of them which contained binary junk. Should we convert the junk?

I haven't yet dug far enough to see what consumes the resulting XML. Does that understand the multi-byte UTF-8 characters that would result from this conversion?

The other possibility is to ignore the conversion here and suppress the sanitizer, and declare the encoding as Windows-1252, on the assumption that the XML consumer knows how to read that. Again, don't setlocale(), as we're not using the locale features of iostream to render the output (I don't think). Or maybe setlocal("C") to suppress any locale processing. The setlocale can happen around the call to TiXmlDocument::Print in ~CXMLFormatter to isolate its effect to the file output.

That's what I did. But it didn't worked. According to http://www.w3.org/TR/REC-xml/#charsets some characters, that are still allowed in the the windows-1252 codepage, are discouraged in XML. esp. most of the characters in the band [x80-x9f].


It looks like tinyxml uses fprintf for everything. If it really wanted to be tiny, it probably should have used fwrite. I don't think that pays attention to the locale and treats everything as binary.
Nope, I uses its own PutString function to convert the characters, since during the output a few characters are encoded into its hexadecimal character reference. This function uses the isspace, iscntrl, ... functions which heavily rely on the locale. It finally only uses fprintf to write the generated string into the file. It could have used fwrite also.

Dirk

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org

Reply via email to