RE: Win32::OLE - ? encoding of returned strings

2007-05-01 Thread Jan Dubois
When Win32::OLE uses CP_UTF8, then it will convert the UCS2 as returned from 
COM (the BSTR datatype) into UTF-8.  There should be no
loss of information in this step, as everything in UCS2 can also be represented 
in UTF8.  The conversion is done by a Windows API
and not by Perl internals.

 

Win32::OLE then asks Perl to downgrade to a non-UTF-8 string if possible, 
because a lot of modules don't handle UTF-8 strings
correctly. This step is error-prone, as Perl always assumes that byte strings 
are Latin-1 encoded and does not take the Windows code
page into account. If *all* characters in the returned string are in Latin-1 
and your Windows codepage is *not* 1252, then you may
get distorted results.

 

I guess I should put some support into Win32::OLE to force it to leave the 
string in UTF8.  But I think this should be done in the
wider context of teaching Perl to deal with Windows code pages better.

 

Cheers,

-Jan

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike
Trotman
Sent: April 29, 2007 2:36 PM
To: perl-win32-users@listserv.ActiveState.com
Subject: Re: Win32::OLE - ? encoding of returned strings

 

Thanks - that's very useful.

And I had missed the Unicode mode in the Win32::OLE documentation.

To clarify:

Part of my problem is that the MSXML XSLT transformations do NOT output the 
character encoding in the XML declaration.
So - $xmldoc-transformNode($xsltsheet) may output Windows-1252 encoded data - 
but does not contain encoding='Windows-1252' in the
!? xml version='1.0'?
and so other parts of the process expect it to contain UTF-8.

If I turn on the Win32::OLE UTF8 code page for a Win32::OLE call to MSXML that 
returns a UTF-16 string (or Windows-1252)
- is the automatic conversion to UTF8 likely to be correct - or incorrect?
IE - does the automatic conversion correctly detect the CP of the OLE output?

Thanks

Jan Dubois wrote: 

On Fri, 27 Apr 2007, Mike Trotman wrote:
  

I am writing a CGI application that uses WIN32::OLE to interface to
Microsoft ADO, MS Access, SQL Server and MSXML for XML documents and
XSLT transformations.
 
I suspect that something in the way I am passing data around (or in
the ADO implementation of 'savetoxml') is not dealing correctly with
XML document encoding declarations. The data is originally in an MS
Access database - but has been entered using copy and paste from MS
Word documents from around the world - so contains many weird and
wonderful bytes.
 
To help in my debugging process can anyone tell me how WIN32::OLE
deals with 'strings' returned from method calls? i.e.
- are they pure byte data as output by the method (and maybe in UTF-16)?
- or are they converted to Perl's internal format (using any current
  Perl encoding settings)?
- or does something else happen?
 
The problems I am having are primarily when outputting XML documents
(or HTML) to send to the browser.
e.g.my $OUTPUT=; $OUTPUT=$xmldoc-transformNode($xsltsheet);
 
print $OUTPUT;


 
All string data is converted to the current system codepage by
Win32::OLE before being passed back to Perl _unless_ you switch
Win32::OLE to Unicode mode first:
 
Win32::OLE-Option(CP = Win32::OLE::CP_UTF8());
 
After this call all strings are converted to UTF8 and marked as
such in the Perl internal flags.
 
Cheers,
-Jan
 
 
Message Scanned by ClamAV on datalucid.com
 
  





-- 
Datalucid Limited
8 Eileen Road
South Norwood
London SE25 5EJ
 
tel :+44-0208-239-6810
 
email: [EMAIL PROTECTED]
web: http://www.datalucid.com

Message Scanned by ClamAV on datalucid.com 

___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


Re: Win32::OLE - ? encoding of returned strings

2007-05-01 Thread Mike Trotman




Thanks Jan - that's very clear and exactly what I needed to know.

I've got most of my conversions working - but there are a few
exceptions and I think your explanation has given me enough information
to track then down and handle them.
(I'm not using the Win32::OLE UTF8 forcing - and am getting some
spurious character conversions.)

Mike


Jan Dubois wrote:

  
  
  

  
  When
Win32::OLE uses CP_UTF8, then it will convert the UCS2 as
returned from COM (the BSTR datatype) into UTF-8. There should be no
loss
of information in this step, as everything in UCS2 can also be
represented in
UTF8. The conversion is done by a Windows API and not by Perl
internals.
  
  Win32::OLE
then asks Perl to downgrade to a non-UTF-8 string if
possible, because a lot of modules dont handle UTF-8 strings
correctly.
This step is error-prone, as Perl always assumes that byte strings are
Latin-1
encoded and does not take the Windows code page into account. If *all*
characters in the returned string are in Latin-1 and your Windows
codepage is *not*
1252, then you may get distorted results.
  
  I
guess I should put some support into Win32::OLE to force it to
leave the string in UTF8. But I think this should be done in the wider
context of teaching Perl to deal with Windows code pages better.
  
  Cheers,
  -Jan
  
  
  
  From:
[EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]] On
Behalf Of Mike
Trotman
  Sent: April 29, 2007 2:36 PM
  To: perl-win32-users@listserv.ActiveState.com
  Subject: Re: Win32::OLE - ? encoding of returned strings
  
  
  
  Thanks - that's very useful.
  
And I had missed the Unicode mode in the Win32::OLE documentation.
  
To clarify:
  
Part of my problem is that the MSXML XSLT transformations do NOT output
the
character encoding in the XML declaration.
So - $xmldoc-transformNode($xsltsheet) may output Windows-1252
encoded data
- but does not contain encoding='Windows-1252' in the !? xml
version='1.0'?
and so other parts of the process expect it to contain UTF-8.
  
If I turn on the Win32::OLE UTF8 code page for a Win32::OLE call to
MSXML that
returns a UTF-16 string (or Windows-1252)
- is the automatic conversion to UTF8 likely to be correct - or
incorrect?
IE - does the automatic conversion correctly detect the CP of the OLE
output?
  
Thanks
  
Jan Dubois wrote: 
  On Fri, 27 Apr 2007, Mike Trotman wrote:
   
  
I am writing a CGI application that uses WIN32::OLE to interface to
Microsoft ADO, MS Access, SQL Server and MSXML for XML documents and
XSLT transformations.

I suspect that something in the way I am passing data around (or in
the ADO implementation of 'savetoxml') is not dealing correctly with
XML document encoding declarations. The data is originally in an MS
Access database - but has been entered using copy and paste from MS
Word documents from around the world - so contains many weird and
wonderful bytes.

To help in my debugging process can anyone tell me how WIN32::OLE
deals with 'strings' returned from method calls? i.e.
- are they pure byte data as output by the method (and maybe in UTF-16)?
- or are they converted to Perl's internal format (using any current
 Perl encoding settings)?
- or does something else happen?

The problems I am having are primarily when outputting XML documents
(or HTML) to send to the browser.
e.g.my $OUTPUT=""; $OUTPUT=$xmldoc-transformNode($xsltsheet);

print $OUTPUT;
 
  
  
  All string data is converted to the current system codepage by
  Win32::OLE before being passed back to Perl _unless_ you switch
  Win32::OLE to Unicode mode first:
  
   Win32::OLE-Option(CP = Win32::OLE::CP_UTF8());
  
  After this call all strings are converted to UTF8 and marked as
  such in the Perl internal flags.
  
  Cheers,
  -Jan
  
  
  Message Scanned by ClamAV on datalucid.com
  
   
  
  
  
  -- 
  Datalucid Limited
  8 Eileen Road
  South Norwood
  London SE25 5EJ
  
  tel :+44-0208-239-6810
  
  email: [EMAIL PROTECTED]
  web: http://www.datalucid.com
  Message Scanned
by ClamAV on
datalucid.com 
  


-- 
Datalucid Limited
8 Eileen Road
South Norwood
London SE25 5EJ

tel :+44-0208-239-6810

email: [EMAIL PROTECTED]
web: http://www.datalucid.com



Message Scanned by ClamAV on datalucid.com


___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs