Later I recognized that the encoding setting is not enough in XML to
force the correct encoding since it is still discouraged to output
specific character.

I suspect that this is an input versus output issue. Doesn't Subversion do everything with UTF-8 internally? So if the XML is ultimately intended to migrate to svn input, I'd suggest converting to UTF-8 early in the process, and make all the XML UTF-8.

Yes, but i hoped, that Perl has a decent XML parser with it already, so I could left this task to Perl.

This is more a policy question than a technical question. Anything below 0x20 that's not white space can probably be dropped as noise. Stuff between 0x7f and 0x9f has multi-byte UTF-8 equivalents.

Currently I try to put everything printeable into the XML. The only problem: Linux doesn't know CP1252. So it can't decide what is printeable.

That's what I did. But it didn't worked. According to
http://www.w3.org/TR/REC-xml/#charsets some characters, that are still
allowed in the the windows-1252 codepage, are discouraged in XML. esp.
most of the characters in the band [x80-x9f].

They're not discouraged. They just have a different encoding. That's where you need the suggested table lookup to generate the multi-byte equivalent.
They explicitly state on the mentioned side: "The characters defined in the following ranges are also discouraged"

To cut a long story short: Sooner or later I will add a windows-1252 to Utf8 converter into ssphys, so that the output XML is in UTF8. But I can't do it within the next time (roughly 3 weeks). I have no objections if someone else does the job. I just want to keep ssphys lean and mean.

Dirk

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org

Reply via email to