Re: iso-2022-jp problem
On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote: I tracked down the problem tkmail was/is having with iso-2022-jp. The snag is I am using the API the way I designed it, not the way it is reliably implemented. When called thus: my $decoded = $enc-decode($encoded,1); decode is supposed to return portion it can decode, and set $encoded to what remains. Ah, I see. But it is pain in the arse for doubly-encoded encodings like ISO-2022-JP. Here is the problem. As you see, to decode ISO-2022-JP, we first have to decode it into EUC-JP. And ISO-2022-JP - EUC-JP is treated (and should be treated) purely as a CES so there is no chance for error (unless there is a bogus escape sequence). However, errors may rise when you try to convert the resulting EUC-JP stream to UTF-8. The problem is that not all of the possible code points in JIS X 0208 and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 are used in 0208 and 6072 are used in 0212. So the remainder won't map to Unicode. It was possible to use jis02*-raw instead of EUC-JP but that implementation was too slow because you have to invoke encode() chunk by chunk. in fact I tried and it got 3 times as slow. And what is a sense of what remain gets moot when it comes to ISO-2022. Suppose you got a string like this; abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu ^^error occurs here. What's the remaining stream? ghijklmnESC-to-asciiopqrstu is WRONG because we are now in jis0208 chunk and escape sequence is already stripped. Do we have to go like ESC-to-jis0208ghijklmnESC-to-asciiopqrstu but that slows down the encoder too much. I just woke up. Let me think about this a little bit more Dan the Encode Maintainer
Re: iso-2022-jp problem
On Tuesday, April 16, 2002, at 12:00 , Nick Ing-Simmons wrote: abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu ^^error occurs here. What's the remaining stream? ghijklmnESC-to-asciiopqrstu Does not matter for that case. does not map is a fatal error with $chk true (and would have become a replacement char if $chk was false). What matters is being able to tell the complete case, from partial case. A. When you have converted whole thing set remains to ''. B. When you have a partial encoding consume as much as you can and leave string with what is partial. e.g. abcdESC-to-jis0208cdefghijklmnESC-to -asciiopqrstu ^- buffer boundary Then you return translation of abcdESC-to-jis0208cdefghijklmn and set remains to Esc-to so that :encoding can append -asciiopqrstu One of many reasons that programmers dislike 7bit ISO-2022 is exactly how to handle case B -- how to split the buffer in the middle When handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY LENGTH. Of course that causes the problem for large files and even worse, network streams. But fortunately, 7bit ISO-2022 has one safety net for that solution; IT ALWAYS REVERTS TO ASCII BEFORE CONTROL CHARACTERS, including CRLF. So if you need it you can safely split buffer line by line. A script binmode(STDOUT, :utf8); while(){ print Encode:decode(iso-2022-jp, $_); } is completely safe because $_ is guaranteed to start in ASCII and end in ASCII. Check RFC 1468 (http://www.ietf.org/rfc/rfc1468.txt and others). It is not as complicated as it sounds. If you cannot do that then don't return or consume anything so :encoding can keep appending till you have whole file but that is going to be very memory hungry. As I said, if you are worried about memory, just use line buffer is the answer. Other encodings are subject to this boundary problem -- and solution. Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for decomposed form), you name it. But very fortunately for all these, legacy encodings for those are all designed so that you can rely on CRLF to split the stream. Dan the Encode Maintainer.
Re: iso-2022-jp problem
Dan Kogai [EMAIL PROTECTED] writes: B. When you have a partial encoding consume as much as you can and leave string with what is partial. e.g. abcdESC-to-jis0208cdefghijklmnESC-to -asciiopqrstu ^- buffer boundary Then you return translation of abcdESC-to-jis0208cdefghijklmn and set remains to Esc-to so that :encoding can append -asciiopqrstu One of many reasons that programmers dislike 7bit ISO-2022 is exactly how to handle case B -- how to split the buffer in the middle When handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY LENGTH. But perlio.c and so :encoding does buffer by size right now. So question is how to proceed from where we are. Of course that causes the problem for large files and even worse, network streams. Which is what caused the problem to come to light. But fortunately, 7bit ISO-2022 has one safety net for that solution; IT ALWAYS REVERTS TO ASCII BEFORE CONTROL CHARACTERS, including CRLF. So if you need it you can safely split buffer line by line. A script binmode(STDOUT, :utf8); while(){ print Encode:decode(iso-2022-jp, $_); } is completely safe because $_ is guaranteed to start in ASCII and end in ASCII. So what needs to happen is that :encoding(iso-2022-jp) needs to push a layer underneath itself which makes sure a whole line is available. You still need to clear the string to let me know that all is well. Or we can mess with the decoder to fake it as you suggested. (Tcl manages it ...) Check RFC 1468 (http://www.ietf.org/rfc/rfc1468.txt and others). It is not as complicated as it sounds. I read that back before someone else implmented our first escape encoding. If you cannot do that then don't return or consume anything so :encoding can keep appending till you have whole file but that is going to be very memory hungry. As I said, if you are worried about memory, just use line buffer is the answer. So we need some way of telling from an encoding object (e.g. an attribute or a method call) that it needs line buffering so that :encoding layer can take the appropriate steps. Other encodings are subject to this boundary problem -- and solution. Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for decomposed form), you name it. But very fortunately for all these, legacy encodings for those are all designed so that you can rely on CRLF to split the stream. Dan the Encode Maintainer. -- Nick Ing-Simmons http://www.ni-s.u-net.com/
Re: iso-2022-jp problem
On Tuesday, April 16, 2002, at 01:06 , Nick Ing-Simmons wrote: So we need some way of telling from an encoding object (e.g. an attribute or a method call) that it needs line buffering so that :encoding layer can take the appropriate steps. Okay, which way do you like, attribute or method ? I think method is more elegant but attribute seems easier to fetch. Since this is more for PerlIO than Encode itself, I would appreciate if you gave me the API (just name would be enough) and I will add them to ISO-2022 stuff (not just JP but KR has one, too). Dan
Re: README.jp, README.tw, README.cn, README.kr
Hi, Attached is README.ko (per Jarkko's suggestion, I used 'ko' instead of 'kr') in EUC-KR encoding. North Korea has its own 94 x 94 coded character set(KPS 9566-97: ISO-IR 202), but a few web pages set up for/by North Korean companies(and possibly government?) of which URLs I happened know use EUC-KR. I also added what Autrijus added to README.tw. Cheers, Jungshik README.ko Description: README in Korean in EUC-KR
Re: README.jp (or README.jp?)
On Tuesday, April 16, 2002, at 08:14 , Jarkko Hietaniemi wrote: Could I ask for the Japanese translation? (Check out Autrijus' latest message about the subject, they had a useful additional section.) Sorry. I was too preoccupied w/ the module itself. Will be submitted before I go to bed. Dan