Anton Shcherbinin <[EMAIL PROTECTED]> writes:
>Encode::from_to($string, SOURCE, TARGET) changes all characters which
>are missing in TARGET into '?' chars (ok, to be exact <subchar>s). This
>is probably the most reasonable *default* behavior. But I could give
>a couple of arguments why other behavior (not to change those chars
>missing in target encoding) is also reasonable and sometimes much more
>reasonable.
>
>My native language, Russian, suffers from having FIVE one-byte encodings
>(windows-1251, koi8-r, iso-8859-5, cp866, "MacCyrillic") which are used
>everywhere alternately more or less often. Conversions from 1
>encoding to another are very often, and sometimes we just have to make
>the reverse conversion.
We get the same "problem" in english with Windows "smart quotes"
and other MSWord-isms being sent out as supposedly iso-8859-1 when
they really meant windows-1250 or whatever.
The problem with just retaining the original is that unless one encoding
is a strict superset of the other and code points are the same for
the same characters the meaning may be corrupted. You are in general
better off "leaving" it as super-set encoding or UTF-8! If you don't
like the '?' there are fallback schemes to put \x{uuuu} or HTML escapes
which at least give the reader a hint as to what was there.
>
>MY QUESTION IS: how can I convert text from 1 one-byte encoding to
>another without changing into '?' (leaving unchanged) characters missing
>in target encoding?
There is no built-in way to do it directly. And from_to is particularly
problematic as the options arg is applied to both the decode and the
re-encode steps - where as you only want to special case the re-encode.
You can do it via internal form something like this:
sub sloppy_from_to
{
my ($src,$SOURCE,$TARGET) = @_;
my $from = find_encoding($SOURCE);
my $to = find_encoding($TARGET);
my $dest = '';
# Assume all of $src is representable in internal form
my $uni = $from->decode($src);
while (length($uni))
{
$dest .= $to->encode($uni,ENCODE_RETURN_ON_ERR);
if (length($uni)) {
# Not all converted...
# some ad. hoc. scheme to "copy" the non representable char
# e.g. chop off 1 char, and re-encode and append that
$dest .= $from->encode(substr($uni,0,1,''));
}
return $dest;
}
>
>I did try to find it out myself. At some point I thought that
>from_to($string, SRCenc, TGTenc, ENCODE_LEAVE_SRC)
>is just what I wanted, because it LEAVEs those chars in SRC that
>ENCODE_NOREP... but unfortunately no, it leaves all source string
>untouched unconditionally.
>
>Thanks in advance for any clues.
>
>If my English and/or my question is far from clear, please tell me and
>I'll do my best to rewrite it in other words.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/