--On Wednesday, April 12, 2006 10:28 AM +0200 Dirk <[EMAIL PROTECTED]> wrote:

The formatter itseld isn't bound to UTF-8. When I dtarted, I didn't
thought that this causes so many problems, so I simply used the setlocale
to detect printeable charcters and the xml-encoding to define the XML
output to be in this locale. So the two lines

    TiXmlDeclaration decl ("1.0", "windows-1252", "");
    if (NULL == setlocale (LC_ALL, ".1252"))

must come in pair.

Later I recognized that the encoding setting is not enough in XML to
force the correct encoding since it is still discouraged to output
specific character.

I suspect that this is an input versus output issue. Doesn't Subversion do everything with UTF-8 internally? So if the XML is ultimately intended to migrate to svn input, I'd suggest converting to UTF-8 early in the process, and make all the XML UTF-8.

How should we treat a bell character in a comment? Translate it? What
about all other control characters? LF, FF, ... ? Esp. the comments seem
to very sensitive for corruption in VSS. I found a lot of them which
contained binary junk. Should we convert the junk?

This is more a policy question than a technical question. Anything below 0x20 that's not white space can probably be dropped as noise. Stuff between 0x7f and 0x9f has multi-byte UTF-8 equivalents.

That's what I did. But it didn't worked. According to
http://www.w3.org/TR/REC-xml/#charsets some characters, that are still
allowed in the the windows-1252 codepage, are discouraged in XML. esp.
most of the characters in the band [x80-x9f].

They're not discouraged. They just have a different encoding. That's where you need the suggested table lookup to generate the multi-byte equivalent.

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org

Reply via email to