Re: iso-2022-jp problem

Nick Ing-Simmons Mon, 15 Apr 2002 08:50:00 -0700

Dan Kogai <[EMAIL PROTECTED]> writes:
>>  B. When you have a partial encoding consume as much as you can
>>     and leave "string" with what is partial.
>>
>> e.g.
>>
>> abcd<ESC-to-jis0208>cdefghijklmn<ESC-to  -ascii>opqrstu....
>>                                        ^- buffer boundary
>>
>> Then you return translation of
>> "abcd<ESC-to-jis0208>cdefghijklmn"
>> and set "remains" to "<Esc-to"
>> so that :encoding can append "-ascii>opqrstu....
>
>One of many reasons that programmers dislike 7bit ISO-2022 is exactly
>how to handle case B -- how to split the buffer in the middle  When
>handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY
>LENGTH.


But perlio.c and so :encoding does buffer by size right now. 
So question is how to proceed from where we are.

>Of course that causes the problem for large files and even
>worse, network streams.  

Which is what caused the problem to come to light.

>But fortunately, 7bit ISO-2022 has one safety
>net for that solution;  IT ALWAYS REVERTS TO ASCII BEFORE CONTROL
>CHARACTERS, including CRLF.  So if you need it you can safely split
>buffer line by line.  A script
>
>binmode(STDOUT, ":utf8");
>while(<>){
>       print Encode:decode("iso-2022-jp", $_);
>}
>
>is completely safe because $_ is guaranteed to start in ASCII and end in
>ASCII.

So what needs to happen is that :encoding(iso-2022-jp) needs to push 
a layer underneath itself which makes sure a whole "line" is available.
You still need to clear the string to let me know that all is well.
Or we can mess with the decoder to "fake" it as you suggested.
(Tcl manages it ...)

>
>Check RFC 1468  (http://www.ietf.org/rfc/rfc1468.txt and others).  It is
>not as complicated as it sounds.

I read that back before someone else implmented our first escape encoding.

>
>> If you cannot do that then don't return or consume anything
>> so :encoding can keep appending till you have whole file but that
>> is going to be very memory hungry.
>
>As I said, "if you are worried about memory, just use line buffer" is
>the answer.

So we need some way of telling from an encoding object (e.g.
an attribute or a method call) that it needs line buffering
so that :encoding layer can take the appropriate steps.

>
>Other encodings are subject to this boundary problem -- and solution.
>Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for
>decomposed form), you name it.  But very fortunately for all these,
>legacy encodings for those are all designed so that you can rely on CRLF
>to split the stream.
>
>Dan the Encode Maintainer.
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: iso-2022-jp problem

Reply via email to