Re: iso-2022-jp problem

Dan Kogai Mon, 15 Apr 2002 08:07:27 -0700

On Tuesday, April 16, 2002, at 12:00 , Nick Ing-Simmons wrote:
>> abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
>>                         ^^error occurs here.
>>
>> What's the remaining stream?
>>
>> ghijklmn<ESC-to-ascii>opqrstu....
>
> Does not matter for that case.
> "does not map" is a fatal error with $chk true (and would have
> become a replacement char if $chk was false).
>
> What matters is being able to tell the complete case, from partial case.
>
>  A. When you have converted whole thing set remains to ''.
>  B. When you have a partial encoding consume as much as you can
>     and leave "string" with what is partial.
>
> e.g.
>
> abcd<ESC-to-jis0208>cdefghijklmn<ESC-to  -ascii>opqrstu....
>                                        ^- buffer boundary
>
> Then you return translation of
> "abcd<ESC-to-jis0208>cdefghijklmn"
> and set "remains" to "<Esc-to"
> so that :encoding can append "-ascii>opqrstu....


One of many reasons that programmers dislike 7bit ISO-2022 is exactly 
how to handle case B -- how to split the buffer in the middle  When 
handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY 
LENGTH.  Of course that causes the problem for large files and even 
worse, network streams.  But fortunately, 7bit ISO-2022 has one safety 
net for that solution;  IT ALWAYS REVERTS TO ASCII BEFORE CONTROL 
CHARACTERS, including CRLF.  So if you need it you can safely split 
buffer line by line.  A script

binmode(STDOUT, ":utf8");
while(<>){
        print Encode:decode("iso-2022-jp", $_);
}

is completely safe because $_ is guaranteed to start in ASCII and end in 
ASCII.

Check RFC 1468  (http://www.ietf.org/rfc/rfc1468.txt and others).  It is 
not as complicated as it sounds.

> If you cannot do that then don't return or consume anything
> so :encoding can keep appending till you have whole file but that
> is going to be very memory hungry.

As I said, "if you are worried about memory, just use line buffer" is 
the answer.

Other encodings are subject to this boundary problem -- and solution.  
Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for 
decomposed form), you name it.  But very fortunately for all these, 
legacy encodings for those are all designed so that you can rely on CRLF 
to split the stream.

Dan the Encode Maintainer.

Re: iso-2022-jp problem

Reply via email to