Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
 I tracked down the problem tkmail was/is having with iso-2022-jp.
 The snag is I am using the API the way I designed it, not the way
 it is reliably implemented.

 When called thus:

 my $decoded = $enc-decode($encoded,1);

 decode is supposed to return portion it can decode, and set $encoded
 to what remains.

Ah,  I see.  But it is pain in the arse for doubly-encoded encodings 
like ISO-2022-JP.

Here is the problem.  As you see, to decode ISO-2022-JP, we first have 
to decode it into EUC-JP.  And ISO-2022-JP - EUC-JP is treated (and 
should be treated) purely as a CES so there is no chance for error 
(unless there is a bogus escape sequence).  However, errors may rise 
when you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208 
and JIS X 0212 are actually used (94x94 = 8836).  of which only 6884 are 
used in 0208 and 6072 are used in 0212.  So the remainder won't map to 
Unicode.

It was possible to use jis02*-raw instead of EUC-JP but that 
implementation was too slow because you have to invoke encode() chunk by 
chunk.  in fact I tried and it got 3 times as slow.

And what is a sense of what remain gets moot when it comes to 
ISO-2022.  Suppose you got a string like this;

abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu
 ^^error occurs here.

What's the remaining stream?

ghijklmnESC-to-asciiopqrstu


is WRONG because we are now in jis0208 chunk and escape sequence is 
already stripped.  Do we have to go like

ESC-to-jis0208ghijklmnESC-to-asciiopqrstu

but that slows down the encoder too much.   I just woke up.  Let me 
think about this a little bit more

Dan the Encode Maintainer




Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 12:00 , Nick Ing-Simmons wrote:
 abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu
 ^^error occurs here.

 What's the remaining stream?

 ghijklmnESC-to-asciiopqrstu

 Does not matter for that case.
 does not map is a fatal error with $chk true (and would have
 become a replacement char if $chk was false).

 What matters is being able to tell the complete case, from partial case.

  A. When you have converted whole thing set remains to ''.
  B. When you have a partial encoding consume as much as you can
 and leave string with what is partial.

 e.g.

 abcdESC-to-jis0208cdefghijklmnESC-to  -asciiopqrstu
^- buffer boundary

 Then you return translation of
 abcdESC-to-jis0208cdefghijklmn
 and set remains to Esc-to
 so that :encoding can append -asciiopqrstu

One of many reasons that programmers dislike 7bit ISO-2022 is exactly 
how to handle case B -- how to split the buffer in the middle  When 
handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY 
LENGTH.  Of course that causes the problem for large files and even 
worse, network streams.  But fortunately, 7bit ISO-2022 has one safety 
net for that solution;  IT ALWAYS REVERTS TO ASCII BEFORE CONTROL 
CHARACTERS, including CRLF.  So if you need it you can safely split 
buffer line by line.  A script

binmode(STDOUT, :utf8);
while(){
print Encode:decode(iso-2022-jp, $_);
}

is completely safe because $_ is guaranteed to start in ASCII and end in 
ASCII.

Check RFC 1468  (http://www.ietf.org/rfc/rfc1468.txt and others).  It is 
not as complicated as it sounds.

 If you cannot do that then don't return or consume anything
 so :encoding can keep appending till you have whole file but that
 is going to be very memory hungry.

As I said, if you are worried about memory, just use line buffer is 
the answer.

Other encodings are subject to this boundary problem -- and solution.  
Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for 
decomposed form), you name it.  But very fortunately for all these, 
legacy encodings for those are all designed so that you can rely on CRLF 
to split the stream.

Dan the Encode Maintainer.




Re: iso-2022-jp problem

2002-04-15 Thread Nick Ing-Simmons

Dan Kogai [EMAIL PROTECTED] writes:
  B. When you have a partial encoding consume as much as you can
 and leave string with what is partial.

 e.g.

 abcdESC-to-jis0208cdefghijklmnESC-to  -asciiopqrstu
^- buffer boundary

 Then you return translation of
 abcdESC-to-jis0208cdefghijklmn
 and set remains to Esc-to
 so that :encoding can append -asciiopqrstu

One of many reasons that programmers dislike 7bit ISO-2022 is exactly
how to handle case B -- how to split the buffer in the middle  When
handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY
LENGTH.  

But perlio.c and so :encoding does buffer by size right now. 
So question is how to proceed from where we are.

Of course that causes the problem for large files and even
worse, network streams.  

Which is what caused the problem to come to light.

But fortunately, 7bit ISO-2022 has one safety
net for that solution;  IT ALWAYS REVERTS TO ASCII BEFORE CONTROL
CHARACTERS, including CRLF.  So if you need it you can safely split
buffer line by line.  A script

binmode(STDOUT, :utf8);
while(){
   print Encode:decode(iso-2022-jp, $_);
}

is completely safe because $_ is guaranteed to start in ASCII and end in
ASCII.

So what needs to happen is that :encoding(iso-2022-jp) needs to push 
a layer underneath itself which makes sure a whole line is available.
You still need to clear the string to let me know that all is well.
Or we can mess with the decoder to fake it as you suggested.
(Tcl manages it ...)


Check RFC 1468  (http://www.ietf.org/rfc/rfc1468.txt and others).  It is
not as complicated as it sounds.

I read that back before someone else implmented our first escape encoding.


 If you cannot do that then don't return or consume anything
 so :encoding can keep appending till you have whole file but that
 is going to be very memory hungry.

As I said, if you are worried about memory, just use line buffer is
the answer.

So we need some way of telling from an encoding object (e.g.
an attribute or a method call) that it needs line buffering
so that :encoding layer can take the appropriate steps.


Other encodings are subject to this boundary problem -- and solution.
Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for
decomposed form), you name it.  But very fortunately for all these,
legacy encodings for those are all designed so that you can rely on CRLF
to split the stream.

Dan the Encode Maintainer.
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/






Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 01:06 , Nick Ing-Simmons wrote:
 So we need some way of telling from an encoding object (e.g.
 an attribute or a method call) that it needs line buffering
 so that :encoding layer can take the appropriate steps.

Okay, which way do you like, attribute or method ?  I think method is 
more elegant but attribute seems easier to fetch.  Since this is more 
for PerlIO than Encode itself, I would appreciate if you gave me the API 
(just name would be enough) and I will add them to ISO-2022 stuff (not 
just JP but KR has one, too).

Dan




Re: README.jp, README.tw, README.cn, README.kr

2002-04-15 Thread Jungshik Shin


 Hi,

  Attached is README.ko (per Jarkko's suggestion, I used
'ko' instead of 'kr') in EUC-KR encoding. North Korea has its own 94
x 94 coded character set(KPS 9566-97: ISO-IR 202), but a few web pages
set up for/by North Korean companies(and possibly government?) of which
URLs I happened know  use EUC-KR.

  I also added what Autrijus added to README.tw.

  Cheers,

  Jungshik 



README.ko
Description: README  in Korean in EUC-KR 


Re: README.jp (or README.jp?)

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 08:14 , Jarkko Hietaniemi wrote:
 Could I ask for the Japanese translation?  (Check out Autrijus' latest
 message about the subject, they had a useful additional section.)

Sorry.  I was too preoccupied w/ the module itself.  Will be submitted 
before I go to bed.

Dan