RE: Win32::OLE - ? encoding of returned strings
When Win32::OLE uses CP_UTF8, then it will convert the UCS2 as returned from COM (the BSTR datatype) into UTF-8. There should be no loss of information in this step, as everything in UCS2 can also be represented in UTF8. The conversion is done by a Windows API and not by Perl internals. Win32::OLE then asks Perl to downgrade to a non-UTF-8 string if possible, because a lot of modules don't handle UTF-8 strings correctly. This step is error-prone, as Perl always assumes that byte strings are Latin-1 encoded and does not take the Windows code page into account. If *all* characters in the returned string are in Latin-1 and your Windows codepage is *not* 1252, then you may get distorted results. I guess I should put some support into Win32::OLE to force it to leave the string in UTF8. But I think this should be done in the wider context of teaching Perl to deal with Windows code pages better. Cheers, -Jan From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Trotman Sent: April 29, 2007 2:36 PM To: perl-win32-users@listserv.ActiveState.com Subject: Re: Win32::OLE - ? encoding of returned strings Thanks - that's very useful. And I had missed the Unicode mode in the Win32::OLE documentation. To clarify: Part of my problem is that the MSXML XSLT transformations do NOT output the character encoding in the XML declaration. So - $xmldoc-transformNode($xsltsheet) may output Windows-1252 encoded data - but does not contain encoding='Windows-1252' in the !? xml version='1.0'? and so other parts of the process expect it to contain UTF-8. If I turn on the Win32::OLE UTF8 code page for a Win32::OLE call to MSXML that returns a UTF-16 string (or Windows-1252) - is the automatic conversion to UTF8 likely to be correct - or incorrect? IE - does the automatic conversion correctly detect the CP of the OLE output? Thanks Jan Dubois wrote: On Fri, 27 Apr 2007, Mike Trotman wrote: I am writing a CGI application that uses WIN32::OLE to interface to Microsoft ADO, MS Access, SQL Server and MSXML for XML documents and XSLT transformations. I suspect that something in the way I am passing data around (or in the ADO implementation of 'savetoxml') is not dealing correctly with XML document encoding declarations. The data is originally in an MS Access database - but has been entered using copy and paste from MS Word documents from around the world - so contains many weird and wonderful bytes. To help in my debugging process can anyone tell me how WIN32::OLE deals with 'strings' returned from method calls? i.e. - are they pure byte data as output by the method (and maybe in UTF-16)? - or are they converted to Perl's internal format (using any current Perl encoding settings)? - or does something else happen? The problems I am having are primarily when outputting XML documents (or HTML) to send to the browser. e.g.my $OUTPUT=; $OUTPUT=$xmldoc-transformNode($xsltsheet); print $OUTPUT; All string data is converted to the current system codepage by Win32::OLE before being passed back to Perl _unless_ you switch Win32::OLE to Unicode mode first: Win32::OLE-Option(CP = Win32::OLE::CP_UTF8()); After this call all strings are converted to UTF8 and marked as such in the Perl internal flags. Cheers, -Jan Message Scanned by ClamAV on datalucid.com -- Datalucid Limited 8 Eileen Road South Norwood London SE25 5EJ tel :+44-0208-239-6810 email: [EMAIL PROTECTED] web: http://www.datalucid.com Message Scanned by ClamAV on datalucid.com ___ Perl-Win32-Users mailing list Perl-Win32-Users@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Re: Win32::OLE - ? encoding of returned strings
Thanks Jan - that's very clear and exactly what I needed to know. I've got most of my conversions working - but there are a few exceptions and I think your explanation has given me enough information to track then down and handle them. (I'm not using the Win32::OLE UTF8 forcing - and am getting some spurious character conversions.) Mike Jan Dubois wrote: When Win32::OLE uses CP_UTF8, then it will convert the UCS2 as returned from COM (the BSTR datatype) into UTF-8. There should be no loss of information in this step, as everything in UCS2 can also be represented in UTF8. The conversion is done by a Windows API and not by Perl internals. Win32::OLE then asks Perl to downgrade to a non-UTF-8 string if possible, because a lot of modules dont handle UTF-8 strings correctly. This step is error-prone, as Perl always assumes that byte strings are Latin-1 encoded and does not take the Windows code page into account. If *all* characters in the returned string are in Latin-1 and your Windows codepage is *not* 1252, then you may get distorted results. I guess I should put some support into Win32::OLE to force it to leave the string in UTF8. But I think this should be done in the wider context of teaching Perl to deal with Windows code pages better. Cheers, -Jan From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Trotman Sent: April 29, 2007 2:36 PM To: perl-win32-users@listserv.ActiveState.com Subject: Re: Win32::OLE - ? encoding of returned strings Thanks - that's very useful. And I had missed the Unicode mode in the Win32::OLE documentation. To clarify: Part of my problem is that the MSXML XSLT transformations do NOT output the character encoding in the XML declaration. So - $xmldoc-transformNode($xsltsheet) may output Windows-1252 encoded data - but does not contain encoding='Windows-1252' in the !? xml version='1.0'? and so other parts of the process expect it to contain UTF-8. If I turn on the Win32::OLE UTF8 code page for a Win32::OLE call to MSXML that returns a UTF-16 string (or Windows-1252) - is the automatic conversion to UTF8 likely to be correct - or incorrect? IE - does the automatic conversion correctly detect the CP of the OLE output? Thanks Jan Dubois wrote: On Fri, 27 Apr 2007, Mike Trotman wrote: I am writing a CGI application that uses WIN32::OLE to interface to Microsoft ADO, MS Access, SQL Server and MSXML for XML documents and XSLT transformations. I suspect that something in the way I am passing data around (or in the ADO implementation of 'savetoxml') is not dealing correctly with XML document encoding declarations. The data is originally in an MS Access database - but has been entered using copy and paste from MS Word documents from around the world - so contains many weird and wonderful bytes. To help in my debugging process can anyone tell me how WIN32::OLE deals with 'strings' returned from method calls? i.e. - are they pure byte data as output by the method (and maybe in UTF-16)? - or are they converted to Perl's internal format (using any current Perl encoding settings)? - or does something else happen? The problems I am having are primarily when outputting XML documents (or HTML) to send to the browser. e.g.my $OUTPUT=""; $OUTPUT=$xmldoc-transformNode($xsltsheet); print $OUTPUT; All string data is converted to the current system codepage by Win32::OLE before being passed back to Perl _unless_ you switch Win32::OLE to Unicode mode first: Win32::OLE-Option(CP = Win32::OLE::CP_UTF8()); After this call all strings are converted to UTF8 and marked as such in the Perl internal flags. Cheers, -Jan Message Scanned by ClamAV on datalucid.com -- Datalucid Limited 8 Eileen Road South Norwood London SE25 5EJ tel :+44-0208-239-6810 email: [EMAIL PROTECTED] web: http://www.datalucid.com Message Scanned by ClamAV on datalucid.com -- Datalucid Limited 8 Eileen Road South Norwood London SE25 5EJ tel :+44-0208-239-6810 email: [EMAIL PROTECTED] web: http://www.datalucid.com Message Scanned by ClamAV on datalucid.com ___ Perl-Win32-Users mailing list Perl-Win32-Users@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs