Dan Kogai <[EMAIL PROTECTED]> writes:
>On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
>> I tracked down the "problem" tkmail was/is having with iso-2022-jp.
>> The snag is I am using the API the way I designed it, not the way
>> it is reliably implemented.
>>
>> When called thus:
>>
>> my $decoded = $enc->decode($encoded,1);
>>
>> decode is supposed to return portion it can decode, and set $encoded
>> to what remains.
>
>Ah,  I see.  But it is pain in the arse for "doubly-encoded" encodings
>like ISO-2022-JP.
>
>Here is the problem.  As you see, to decode ISO-2022-JP, we first have
>to decode it into EUC-JP.  And ISO-2022-JP -> EUC-JP is treated (and
>should be treated) purely as a CES so there is no chance for error
>(unless there is a bogus escape sequence).  However, errors may rise
>when you try to convert the resulting EUC-JP stream to UTF-8.
>
>The problem is that not all of the possible code points in JIS X 0208
>and JIS X 0212 are actually used (94x94 = 8836).  of which only 6884 are
>used in 0208 and 6072 are used in 0212.  So the remainder won't map to
>Unicode.
>
>It was possible to use jis02*-raw instead of EUC-JP but that
>implementation was too slow because you have to invoke encode() chunk by
>chunk.  in fact I tried and it got 3 times as slow.
>
>And what is a sense of "what remain" gets moot when it comes to
>ISO-2022.  Suppose you got a string like this;
>
>abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
>                         ^^error occurs here.
>
>What's the remaining stream?
>
>ghijklmn<ESC-to-ascii>opqrstu....

Does not matter for that case.
"does not map" is a fatal error with $chk true (and would have 
become a replacement char if $chk was false).

What matters is being able to tell the complete case, from partial case.

 A. When you have converted whole thing set remains to ''.
 B. When you have a partial encoding consume as much as you can
    and leave "string" with what is partial.

e.g.

abcd<ESC-to-jis0208>cdefghijklmn<ESC-to  -ascii>opqrstu....
                                       ^- buffer boundary

Then you return translation of 
"abcd<ESC-to-jis0208>cdefghijklmn"
and set "remains" to "<Esc-to"
so that :encoding can append "-ascii>opqrstu....                              

If you cannot do that then don't return or consume anything
so :encoding can keep appending till you have whole file but that 
is going to be very memory hungry.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/



Reply via email to