Re: Unicode again

Dirk Wed, 16 Aug 2006 15:35:37 -0700

Hi,

I just found a comment, that windows UNICODE is UCS-2. What do you thinkabout the following specific code for Windows to convert from thedecoded ANSI input to UTF-8:


 // Convert file ANSI to Windows UNICODE (AKA UCS-2)
MultiByteToWideChar(CP_ACP,0,....);

  // now convert from Windows UNICODE (AKA UCS-2) to UTF-8

 WideCharToMultiByte(CP_UTF8,0,....);


on linux we could use iconv, or whatever.

Dirk


Dirk schrieb:

And also look here for TinyXMLs support for UTF-8:
http://www.grinninglizard.com/tinyxmldocs/index.html
We're using TinyXML just for writing, so it doesn't need recognitioncode. The converter may run on a platform other than that used tocreate the VSS DB, so the VSS locale may not be available. Hence mylatest patch to force TinyXML to "trust" the specified locale,declareit in the output XML, and pass through all characters from the sourcephysical files unmodified. It's left to the Perl XML parser toconvert the encoding to Unicode internally. Currently the XMLencoding is hardcoded to Windows-1252 and needs to be patched in theC++ source by users of other VSS locales. It might be desirable topass this in as an argument to ssphys.
Yes, that's what I always wrote in my mails. Let ssphys pass allcharacters unmodified and use perls XML parser for the realconversion. When I read the links mentioned in my first mail of thisthread, I start to think a little different:
1.) We still have the problem to transport "discouraged" characters,even if they are encoded in the "&#" form2.) We need to specify the correct codepage (this should be easy) inthe header
3.) We should bypass the console and directly write into a file.
While I'm searching for more information, have you got an idea aboutthe encoding of what windows things is UNICODE. Is it UTF16 or UCS2?If I understand all things correct, UTF16 is again a variableencoding, since code points above 0x10000 are mapped into two 16bitcode values.
UCS2 seems to clip the possible range of all Unicode scalar values.
I have a better understanding for the "discouraged" characters now:(http://skew.org/xml/tutorial/)
Note that the XML 1.0 Recommendation refers to UCS characters bytheir Unicode scalar values, using a notation of |#x| followed byonly as many hex digits as needed. So |#x9| in the EBNF productionsmeans the abstract character that would be represented in Unicode3.1's "U+" notation as |U+0009|. It does not necessarily mean a bytewith hex value 9.
I always interpreted this UCS mapping and microsofts UNICODE mappingas being equivalent. I'm not that sure anymore.
So what is the correct way?????

1.) We know that VSS is encoded in the current users ASCII locale
2.) We can automatically determine this locale, or we can specify iton the commandline3.) We can output the XML file in the current locale, even if we getdiscouraged characters4.) We can try to convert to UTF8 using the |WideCharToMultiBytefunctions
Grrrrr

Any further ideas?

Dirk
|
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

Re: Unicode again

Reply via email to