Re: Can from_to($s, SRC, TGT) leave chars missing in TGT unchanged?

Nick Ing-Simmons Fri, 11 Oct 2002 01:11:58 -0700

Anton Shcherbinin <[EMAIL PROTECTED]> writes:
>Encode::from_to($string,  SOURCE,  TARGET)  changes all characters which
>are  missing in TARGET into '?' chars (ok, to be exact <subchar>s). This
>is  probably  the  most  reasonable *default* behavior. But I could give
>a  couple  of  arguments  why  other behavior (not to change those chars
>missing  in  target encoding) is also reasonable and sometimes much more
>reasonable.
>
>My native language, Russian, suffers from having FIVE one-byte encodings
>(windows-1251,  koi8-r, iso-8859-5, cp866, "MacCyrillic") which are used
>everywhere  alternately  more   or   less   often.   Conversions  from 1
>encoding to another  are  very often, and sometimes we just have to make
>the reverse conversion.


We get the same "problem" in english with Windows "smart quotes" 
and other MSWord-isms being sent out as supposedly iso-8859-1 when 
they really meant windows-1250 or whatever.

The problem with just retaining the original is that unless one encoding
is a strict superset of the other and code points are the same for 
the same characters the meaning may be corrupted. You are in general 
better off "leaving" it as super-set encoding or UTF-8! If you don't 
like the '?' there are fallback schemes to put \x{uuuu} or HTML escapes
which at least give the reader a hint as to what was there.

>
>MY   QUESTION   IS:  how  can I convert text from 1 one-byte encoding to
>another without changing into '?' (leaving unchanged) characters missing
>in target encoding?

There is no built-in way to do it directly. And from_to is particularly 
problematic as the options arg is applied to both the decode and the 
re-encode steps - where as you only want to special case the re-encode.

You can do it via internal form something like this:

sub sloppy_from_to
{
     my ($src,$SOURCE,$TARGET) = @_;
     my $from = find_encoding($SOURCE);
     my $to   = find_encoding($TARGET);
     my $dest = '';
     # Assume all of $src is representable in internal form
     my $uni = $from->decode($src);
     while (length($uni))
      {
       $dest .= $to->encode($uni,ENCODE_RETURN_ON_ERR);
       if (length($uni)) {
         # Not all converted...
         # some ad. hoc. scheme to "copy" the non representable char
         # e.g. chop off 1 char, and re-encode and append that
         $dest .= $from->encode(substr($uni,0,1,''));       
       } 
     return $dest;      
}

>
>I did try to find it out myself. At some point I thought that
>from_to($string, SRCenc, TGTenc, ENCODE_LEAVE_SRC)
>is  just  what  I  wanted,  because  it  LEAVEs  those chars in SRC that
>ENCODE_NOREP...  but  unfortunately  no,  it  leaves  all  source string
>untouched unconditionally.
>
>Thanks in advance for any clues.
>
>If  my  English and/or my question is far from clear, please tell me and
>I'll do my best to rewrite it in other words.
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Can from_to($s, SRC, TGT) leave chars missing in TGT unchanged?

Reply via email to