Later I recognized that the encoding setting is not enough in XML to
force the correct encoding since it is still discouraged to output
specific character.
I suspect that this is an input versus output issue. Doesn't
Subversion do everything with UTF-8 internally? So if the XML is
ultimately intended to migrate to svn input, I'd suggest converting to
UTF-8 early in the process, and make all the XML UTF-8.
Yes, but i hoped, that Perl has a decent XML parser with it already, so
I could left this task to Perl.
This is more a policy question than a technical question. Anything
below 0x20 that's not white space can probably be dropped as noise.
Stuff between 0x7f and 0x9f has multi-byte UTF-8 equivalents.
Currently I try to put everything printeable into the XML. The only
problem: Linux doesn't know CP1252. So it can't decide what is printeable.
That's what I did. But it didn't worked. According to
http://www.w3.org/TR/REC-xml/#charsets some characters, that are still
allowed in the the windows-1252 codepage, are discouraged in XML. esp.
most of the characters in the band [x80-x9f].
They're not discouraged. They just have a different encoding. That's
where you need the suggested table lookup to generate the multi-byte
equivalent.
They explicitly state on the mentioned side: "The characters defined in
the following ranges are also discouraged"
To cut a long story short: Sooner or later I will add a windows-1252 to
Utf8 converter into ssphys, so that the output XML is in UTF8. But I
can't do it within the next time (roughly 3 weeks). I have no objections
if someone else does the job. I just want to keep ssphys lean and mean.
Dirk
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org