--On Wednesday, April 12, 2006 10:28 AM +0200 Dirk <[EMAIL PROTECTED]> wrote:
The formatter itseld isn't bound to UTF-8. When I dtarted, I didn't
thought that this causes so many problems, so I simply used the setlocale
to detect printeable charcters and the xml-encoding to define the XML
output to be in this locale. So the two lines
TiXmlDeclaration decl ("1.0", "windows-1252", "");
if (NULL == setlocale (LC_ALL, ".1252"))
must come in pair.
Later I recognized that the encoding setting is not enough in XML to
force the correct encoding since it is still discouraged to output
specific character.
I suspect that this is an input versus output issue. Doesn't Subversion do
everything with UTF-8 internally? So if the XML is ultimately intended to
migrate to svn input, I'd suggest converting to UTF-8 early in the process,
and make all the XML UTF-8.
How should we treat a bell character in a comment? Translate it? What
about all other control characters? LF, FF, ... ? Esp. the comments seem
to very sensitive for corruption in VSS. I found a lot of them which
contained binary junk. Should we convert the junk?
This is more a policy question than a technical question. Anything below
0x20 that's not white space can probably be dropped as noise. Stuff between
0x7f and 0x9f has multi-byte UTF-8 equivalents.
That's what I did. But it didn't worked. According to
http://www.w3.org/TR/REC-xml/#charsets some characters, that are still
allowed in the the windows-1252 codepage, are discouraged in XML. esp.
most of the characters in the band [x80-x9f].
They're not discouraged. They just have a different encoding. That's where
you need the suggested table lookup to generate the multi-byte equivalent.
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org