Hi,
I just found a comment, that windows UNICODE is UCS-2. What do you think
about the following specific code for Windows to convert from the
decoded ANSI input to UTF-8:
// Convert file ANSI to Windows UNICODE (AKA UCS-2)
MultiByteToWideChar(CP_ACP,0,....);
// now convert from Windows UNICODE (AKA UCS-2) to UTF-8
WideCharToMultiByte(CP_UTF8,0,....);
on linux we could use iconv, or whatever.
Dirk
Dirk schrieb:
And also look here for TinyXMLs support for UTF-8:
http://www.grinninglizard.com/tinyxmldocs/index.html
We're using TinyXML just for writing, so it doesn't need recognition
code. The converter may run on a platform other than that used to
create the VSS DB, so the VSS locale may not be available. Hence my
latest patch to force TinyXML to "trust" the specified locale,declare
it in the output XML, and pass through all characters from the source
physical files unmodified. It's left to the Perl XML parser to
convert the encoding to Unicode internally. Currently the XML
encoding is hardcoded to Windows-1252 and needs to be patched in the
C++ source by users of other VSS locales. It might be desirable to
pass this in as an argument to ssphys.
Yes, that's what I always wrote in my mails. Let ssphys pass all
characters unmodified and use perls XML parser for the real
conversion. When I read the links mentioned in my first mail of this
thread, I start to think a little different:
1.) We still have the problem to transport "discouraged" characters,
even if they are encoded in the "&#" form
2.) We need to specify the correct codepage (this should be easy) in
the header
3.) We should bypass the console and directly write into a file.
While I'm searching for more information, have you got an idea about
the encoding of what windows things is UNICODE. Is it UTF16 or UCS2?
If I understand all things correct, UTF16 is again a variable
encoding, since code points above 0x10000 are mapped into two 16bit
code values.
UCS2 seems to clip the possible range of all Unicode scalar values.
I have a better understanding for the "discouraged" characters now:
(http://skew.org/xml/tutorial/)
Note that the XML 1.0 Recommendation refers to UCS characters by
their Unicode scalar values, using a notation of |#x| followed by
only as many hex digits as needed. So |#x9| in the EBNF productions
means the abstract character that would be represented in Unicode
3.1's "U+" notation as |U+0009|. It does not necessarily mean a byte
with hex value 9.
I always interpreted this UCS mapping and microsofts UNICODE mapping
as being equivalent. I'm not that sure anymore.
So what is the correct way?????
1.) We know that VSS is encoded in the current users ASCII locale
2.) We can automatically determine this locale, or we can specify it
on the commandline
3.) We can output the XML file in the current locale, even if we get
discouraged characters
4.) We can try to convert to UTF8 using the |WideCharToMultiByte
functions
Grrrrr
Any further ideas?
Dirk
|
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user