RE: extracting words
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Root extraction is decidecly non-trivial and a highly language-specific problem, even more so than word breaking, it's a messy linguistic problem instead of a clean algoritmic problems. To start with, the choice of the term "extraction" shows that one has not understood the problem in all its g(l)ory: a more appropriate term would be "finding", or maybe, "reducing" the root. Also, I would add - "syllablization" (is that a word?) as a third problem (for breaking words more nicely into lines), it would rank in difficulty somewhere between word breaking and root extraction. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character?
RE: extracting words
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Jarkko Root extraction is decidecly non-trivial and a highly Jarkko language-specific problem, even more so than word breaking, it's a Jarkko messy linguistic problem instead of a clean algoritmic problems. Jarkko To start with, the choice of the term "extraction" shows that one Jarkko has not understood the problem in all its g(l)ory: a more Jarkko appropriate term would be "finding", or maybe, "reducing" the Jarkko root. The words we use in computational linguistics are "stemming" and less frequently "lemmatization." This is often the step in morphological analysis that precedes determining the part-of-speech. Jarkko is right that it is a messy problem for many languages. Jarkko - "syllablization" (is that a word?) as a third problem (for Jarkko breaking words more nicely into lines), it would rank in Jarkko difficulty somewhere between word breaking and root extraction. I believe "syllabization" or perhaps "syllabification" might be the term. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character? Spaces and other punctuation come in handy for line breaking. Segmentation is used with scripts that don't use this sort of intra-sentence term separation (i.e. Chinese, Japanese, Thai). There are whole conferences devoted to segmentation approaches. Another messy area of computational linguistics :-) If segmentation is not available, then lines are often wrapped between characters. - Mark Leisher But there is no doubt but money is to the Computing Research Labfore now. It is the romance, the poetry New Mexico State University of our age. It's the thing that chiefly Box 30001, Dept. 3CRL strikes our imagination. Las Cruces, NM 88003 -- The Rise of Silas Lapham, W. D. Howells
[OT?] Re: extracting words
In a message dated 2001-02-12 8:54:10 Pacific Standard Time, [EMAIL PROTECTED] writes: Also, I would add - "syllablization" (is that a word?) as a third problem (for breaking words more nicely into lines), it would rank in difficulty somewhere between word breaking and root extraction. I think the canonical word is "syllabification," but from a word-inventing perspective, I agree with Jarkko's first instinct. The suffix "-ize" seems more appropriate to the process being discussed than "-fy". -Doug Ewell Fullerton, California
FW: Doubt about XML
-Original Message- From: Miguel Angel Lopez [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 2:36 AM To: [EMAIL PROTECTED] Subject: Doubt Good morning. I write from Spain I have one doubt, and I wonder if you can help me. I want my xml file to have required "attlist", so I put in my DTD file the next text: . !ATTLIST Identificacion IdCliente NMTOKEN #REQUIRED Nombre NMTOKENS #REQUIRED Apellido1 NMTOKENS #REQUIRED Apellido2 CDATA #IMPLIED The problem is: 1. If in the corresponding xml file the IdCliente attribute equals "", it does not produce error !! 2. If in the corresponding xml file the Nombre attribute equals "JohnO" it's OK, but if it is equals "John'O" (nfpl !!!), it gifs error!!! How can I do to get that in the first case produce error and in the second not to produce error. Thanks and excuse my poor English begin 600 malgonzalez.vcf M8F5G:6XZ=F-AF0@#0IN.DQO5Z.TUI9W5E;"!!;F=E;`T*"UM;WII;QA M+6AT;6PZ1D%,4T4-"G5R;#IW=WN:6YDF$N97,-"F]R9SI!5$Q!3E1%(%-) M4U1%34%3+"!3+DPN.TEN9V5N:65R:6$@2!$97-AG)O;QO#0IV97)S:6]N M.C(N,0T*96UA:6P[:6YT97)N970Z;6%L9V]NF%L97I`:6YDF$N97,-"F%D MCMQ=6]T960M')I;G1A8FQE.CL[8R\@5F5L87IQ=65Z(#$S,CTP1#TP03M- M861R:60@("`Y,2`S-#@@,3(@-S@[.SM%W!A\6$-"F9N.DUI9W5E;"!!;F=E 4;"!,;W!E@T*96YD.G9C87)D#0H= ` end
Re: Teletext mappings
About this topic, please note (for what it's worth) that I did such a mapping a while ago, in the making of Canadian standard CAN/CSA Z243.4.1 (Ordering standard for French and English) and CAN/CSA Z243.230 (Localization parameters for French and English as used in Canada). It is possible that I goofed for some characters though, in absence of any clue, particularly for non-spacing characters and particularly because I went beyond Telelex, including NAPLPS CS (North American videotex character set, still in use). Dr Umamaheswaran revised this data at IBM but I don't know if this company had better data than I had and for which I had to make some bold decisions, I must admit (decisions not challenged for years)... If there is somebody guilty of any mistake in those standards, I am... In those standards I mapped all characters using U notation... Alain LaBont, Qubec Page personnelle : http://www.iquebec.com/cyberiel
Re: extracting words
From: "Kenneth Whistler" [EMAIL PROTECTED] the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists, I know that the term "syllable" is not technically correct here, so please don't nitpick me to death on this one. ;-) Ironically enough, there are a number of native speakers who struggle with the fact that "syllable" is apparently the best available word for them, if all of the usual connotations could be dispensed with. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Korean linebreking and UTR14(was Re: extracting words)
On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is the MD TR. Both can be found by going to www.unicode.org, and selecting the right MD topic. The TR in particular discusses the recommended approach to line break MD in great detail. As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. UTR14 1. Korean uses either implicit breaking around UTR14 Hangul and ideographs or uses spaces. Reference [1] shows UTR14 how this can be elegantly handled by the second or third UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL UTR14 are affected. For alphabetic style line breaking, breaks UTR14 for these four cases require space, for ideographic style UTR14 line breaking, these four cases don't require spaces. where style 1 and style2 are defined as UTR14 1. Western (spaces and hyphens are used to determine breaks) UTR14 2. East Asian (lines can break anywhere, unless prohibited) Let me make it clear that virtually NO books published in Korean uses space-based (style 1) line breaking rule. Style 2 line breaking rule is *exclusively* used for modern Korean text no matter what some broken word processors for Korean offer as an alternative to style 2 and what some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. Regards, Jungshik Shin
international characters in email subject line
Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. michka - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 3:29 PM Subject: RE: international characters in email subject line I wrote a java application which sends emails to a relay server (Postfix). My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:21 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
RE: international characters in email subject line
Michael, Do you know of any email client which CAN do this and also display the from alias of the email in the desired charset? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:31 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. michka - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 3:29 PM Subject: RE: international characters in email subject line I wrote a java application which sends emails to a relay server (Postfix). My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:21 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
The email program I am using, mutt, can do this. Kind regards keld Simonsen On Mon, Feb 12, 2001 at 02:55:41PM -0800, Michael (michka) Kaplan wrote: What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
On Mon, 12 Feb 2001, Michael (michka) Kaplan wrote: From: "Raghu Kolluru" [EMAIL PROTECTED] I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. The question is so vague. If you need to get some help, you've gotta provide as much information as possible(what mail program under what OS for what character set). There are so many possibilities and nobody would wish to go thru all of them. What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. Mozilla and Netscape 6 support entering subject header in whatever script for which input methods are available/installed in the OS (MS-Windows, MacOS, Unix/X11). In this respect, I18N of Mozilla/Netscape 6 is ahead of that of MS Outlook. The same is true of display of subject headers in scripts which happens not to be supported by the default codepage (to use MS terminology). BTW, one of the worst MUAs in terms of I18N (among the widely used) might be Eudora. BTW, most modern Unix text-based mail programs (e.g. Pine, Mutt) work fine in this regard as long as you run them under the terminal that supports input/ouput of the charset you want to use (for UTF-8, the newest xterm works well for a pretty large range of the BMP). Jungshik Shin
[OT]RE: international characters in email subject line
On Mon, 12 Feb 2001, Raghu Kolluru wrote: I wrote a java application which sends emails to a relay server (Postfix). When you write your java application, note that any 8bit character is explicitly prohibited(IETF STD 11/RFC 822). You need to encode them per IETF RFC 2047 (and RFC 2184, 2231). Some MTAs(mail transport agent) refuse to accept messages with 8bit characters in the header depending on the configuration. BTW, the header encoding is not just for working around those MTAs but also for the sake of identifying MIME charset/encoding used and allowing the possibility of multiple MIME charset/encoding mixed in the header (the latter might be mute when UTF-8 is exclusively used) My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Usenet newsgroup comp.mail.mime is the best place to ask your question. (it has the mail-submission address as well, but I don't know it) BTW, MS OE doesn't support it while Mozilla does support it. Jungshik Shin P.S. I'm afraid Unicode mailing list server strips off too many header lines of messages. In this case and some other cases(e.g. when people talke about the safe 'transport' of UTF-8 messages), 'X-Mailer:' header would be nice to have.
Re: Korean linebreking and UTR14(was Re: extracting words)
Asmus Freytag is the one to talk to; he can look into this. Mark - Original Message - From: "Jungshik Shin" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 13:33 Subject: Korean linebreking and UTR14(was Re: extracting words) On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is the MD TR. Both can be found by going to www.unicode.org, and selecting the right MD topic. The TR in particular discusses the recommended approach to line break MD in great detail. As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. UTR14 1. Korean uses either implicit breaking around UTR14 Hangul and ideographs or uses spaces. Reference [1] shows UTR14 how this can be elegantly handled by the second or third UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL UTR14 are affected. For alphabetic style line breaking, breaks UTR14 for these four cases require space, for ideographic style UTR14 line breaking, these four cases don't require spaces. where style 1 and style2 are defined as UTR14 1. Western (spaces and hyphens are used to determine breaks) UTR14 2. East Asian (lines can break anywhere, unless prohibited) Let me make it clear that virtually NO books published in Korean uses space-based (style 1) line breaking rule. Style 2 line breaking rule is *exclusively* used for modern Korean text no matter what some broken word processors for Korean offer as an alternative to style 2 and what some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. Regards, Jungshik Shin
Re: international characters in email subject line
Ar 12 Feb 2001, ag 15:06 scrobh Michael (michka) Kaplan fn bhar "Re: international characters in ema": Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. Ar 12 Feb 2001, ag 15:46 scrobh Jungshik Shin fn bhar "[OT]RE: international characters in": On Mon, 12 Feb 2001, Raghu Kolluru wrote: My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. [snip] BTW, MS OE doesn't support it while Mozilla does support it. This is simply not true! I know we all like to bash MS from time to time, but people really get far too carried away. I don't know if the above is true about Outlook (as my installation is stuffed as far as e-mail goes), but it is NOT TRUE about Outlook Express. OE encodes the subject line with the same encoding as the body and often (?) the From header as well. Whether or not this works for you would probably depend on what OS you are using and what language features are installed. It works for me with OE 5.50.4133.2400 on Windows NT 4.0 SP5. Of course, since my preferred mail program is Pegasus Mail, which can only be configured for one character set, I can't usually read such headers anyway. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] Nuair a bhonn an fon istigh, bonn an ciall amuigh. Seanfhocal.
Re: Korean linebreking and UTR14(was Re: extracting words)
On Mon, 12 Feb 2001, Mark Davis wrote: Thank you for your answer. Asmus Freytag is the one to talk to; he can look into this. Do you think I should contact him directly off-line? I thought he's on this list now as well as back in March 2000 when I wrote about TUS 3.0 p. 124. On Mon, 12 Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote: On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. BTW, this clearly shows that what Rick McGowan wrote about 'either ... or' in response to what I wrote about Korean line breaking rule (TUS 3.0 p. 124) in March 2000 is not right like I argued then. I'm sure he's right about 'either ... or ' in English grammar but the intention of the author is on my side if the author of UTR 14 is the same as that of the part in question in TUS 3.0. I'm enclosing at the end of this message a part of my message in response to him. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. This didn't make it in Unicode 3.1, either. What would be the best way to get it addressed before next revision comes out? I'm afraid just raising it on this list wouldn't be sufficient (of course, I should have followed up more vigorously last year) Regards, Jungshik Shin Enc. 1. Two messages of mine the first one : March 1, 2000 the second one: March 2, 2000 From: Jungshik Shin [EMAIL PROTECTED] Subject: Korean line breaking rules : Unicode 3.0 (p. 124) Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST) On Sun, 13 Feb 2000, Kenneth Whistler wrote: Lest anyone feel unduly constrained, let me note that now that the editorial committee has closed the book, so to speak, on Unicode 3.0, all of you who are about to open the book for the first time should feel free to unleash your commentary on the text. I've just received my copy of Unicode 3.0 book, here goes my first commentary. On page 124(section 5.15 Locatiing Text element boundaries), the third paragraph has the following around the end: U3.0 In particular, word, line, and sentence boundaries will need to U3.0 be customized according to locale and user preference. In Korean, U3.0 for example, lines may be broken either at spaces(as in Latin text) or U3.0 on ideographic boundaries (as in Chinese). First of all, it's a great mystery to me how on earth this strange notion of Korean having *two* different line breaking rules(as opposed to one) crept into the expertise of non-Korean experts on Korean and finally made it into Unicode 3.0 book and Unicode TR on line breaking. None of tens of Korean books on my bookshelves I've just gone through breaks lines *exclusively* at spaces. All of them break lines freely at *syllables*. Only places where lines are broken *exclusively* at spaces(for Korean text) I can think of are completely *broken*(as far as Korean line breaking is concerned) web browsers like Netscape and MS IE and possibly earlier implementations of Korean LaTeX. One may add to the list Korean text formatted by non-localized version of 'fmt' (in Unix) as another example. To work around the problem caused by these broken web browsers, some Korean web authors apply a simple filter to insert wbr between every pair of Korean syllables to their html files. To see what I mean, you may wanna take a look at http://photon.hgs.yale.edu/~jungshik/lb.html and http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg Let me emphasize that line can be broken at any syllable boundaries in Korean text (except for some obvious exceptions as applied in English text: i.e. punctuation marks like '!', '?' cannot begin a line). Secondly, even in Latin scripts(well, at least in English) lines can be broken not only at spaces but also at syllables(syllabic boundaries) with hyphen. Only difference between Korean line breaking and English line breaking is Korean doesn't need hyphen when lines are broken at syllables because in Korean syllables form another visual unit a level higher than alphabetic/phonetic letters(consonants and vowels). Thirdly, the expression 'ideographic boundaries' is not appropriate 'syllabic boundaries' or 'syllables'. Given these, I'd like to suggest the last sentence(that begins with 'In Korean, for instance...') be removed in the future edition because Korean is NOT a good example case where there can be multiple line breaking rules depending on user preference. Jungshik Shin From: Jungshik Shin [EMAIL PROTECTED] Subject: RE: Korean
Re: international characters in email subject line
On Mon, 12 Feb 2001, Sean O Seaghdha wrote: On 12 Feb 2001, Michael (michka) Kaplan wrote: Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. On 12 Feb 2001, Jungshik Shin wrote: On Mon, 12 Feb 2001, Raghu Kolluru wrote: My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. [snip] BTW, MS OE doesn't support it while Mozilla does support it. This is simply not true! I know we all like to bash MS from time to time, but people really get far too carried away. I don't know if the above is true about Outlook (as my installation is stuffed as far as e-mail goes), but it is NOT TRUE about Outlook Express. OE encodes the subject line with the same encoding as the body and often (?) the From header as well. I stand corrected(thank you for correcting me). It's possible to enter whatever script supported by IMEs installed on your system in both Subject(and other headers) and body of the message. However, what I wrote about the display of the headers in scripts NOT supported by the default system code page still stands. For instance, MS OE cannot display Korean, Japanese, Chinese, Russian headers under English/French/Spanish/Italian/German MS-Windows in _the message *list* display pane_, which Mozilla can. MS OE can display those headers for individual messages.), though. Not having checked out MS OE for a while, I was a bit confused what is possible and what is not. Anyway, my comment and michka's have *nothing* to do with MS bashing. I was just giving what I believed to be facts, one of which was not true as it turned out. Please, note that Michael (michka) Kaplan, I guess is, one of the last persons on this list to say something not true just to make MS look bad. Of course, by this I'm not implying by any means that there are some people who would do that on this list. Jungshik Shin
Re: extracting words
On Sun, 11 Feb 2001, Mark Davis wrote: BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the main languages written without spaces. As I wrote earlier and you correctly believe, spaces are used to separate words in Korean text. That has been the case at least since the Korean Linguistic Society - KLS: Hangul Hakhoe - published the unified rules of Korean orthography in 1933. This practice of using spaces must have been predominant well before that because otherwise the Korean Linguistic Society might not have come up with that. The ortographic standards of both North and South Korea agree on this point. More details are available at http://www.hangeul.or.kr in Korean only. The full text of various standards at the site - four orthographic standards (KLS : 1933, 1980, North Korea: 1987, South Korea MOE: 1988), transliteration of foreign words in Hangul(South Korea MOE, 1985), transcrption of Korean in Roman alphabets - are only available in HWP - one of the most popular word processors in Korea - format which can be viewed with Namo HWP viewer for MS-Windows at http://www.namo.co.kr/download/dwn_hwpv.html. People in the US may find that the bottom of each page gets cropped if printed directly from Namo HWP viewer as they're made for A4 paper. A way around is print to a file (using a PS printer driver) and use ghostscript to print (using PDFWriter may do the same trick). If interested, drop me a line off-line and I'll send a copy either in PDF or PS (resized to better fit US letter paper if necessary) Jungshik Shin
Re: international characters in email subject line
Ar 12 Feb 2001, ag 20:40 scrobh Jungshik Shin fn bhar "Re: international characters in ema": I stand corrected(thank you for correcting me). It's possible to enter whatever script supported by IMEs installed on your system in both Subject(and other headers) and body of the message. However, what I wrote about the display of the headers in scripts NOT supported by the default system code page still stands. For instance, MS OE cannot display Korean, Japanese, Chinese, Russian headers under English/French/Spanish/Italian/German MS-Windows in _the message *list* display pane_, which Mozilla can. MS OE can display those headers for individual messages.), though. Thank you for your clarification. MS OE doesn't show any chars outside the system code page in the message list, only in the preview pane and message windows. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] N bhonn tran buan. Seanfhocal.
Re: international characters in email subject line
From: "Jungshik Shin" [EMAIL PROTECTED] Please, note that Michael (michka) Kaplan, I guess is, one of the last persons on this list to say something not true just to make MS look bad. There are a few program managers in Office and Visual Studio who might disagree with this statement -- they seem to think I live to bash Microsoft. They are mistaken, sadly. But no company is above having their boneheaded decisions called out, something not everyone there understands. Its nice that you do, though. :-) Of course, by this I'm not implying by any means that there are some people who would do that on this list. Its ok, we all know that such people exist; heck, we probably all know who they are, too. As long as we don't name names, no can claim to be offended unless they have felon's guilt or something. :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: international characters in email subject line
Ar 12 Feb 2001, ag 20:28 scrobh Alain LaBont fn bhar "Re: international characters in ema": 19:53 01-02-12 -0800, Sean O Seaghdha a crit: Of course, since my preferred mail program is Pegasus Mail, which can only be configured for one character set, I can't usually read such headers anyway. [Alain] Some years ago, I was also using Pegasus mail and I was not satisfied with this. I then communicated with the author directly (he lives in Sourthern New Zealand); we engaged in a series of exchanges and I made him accept to carry on the character set in use without conversion [in my case the Windows character set]... You have to use a parameter for this, this is the compromise he made me accept because he was really impressed by the SMTP 7-bit-only-headers dogma -- which does not impress me since it works any way with 8-bit-clean systems [predominant nowadays in the world since a serious security breach, I was told, was corrected with an 8-bit-clean-enabling SMTP patch]. I think there are a couple of different issues here. As far as storage on disk goes, I think this changed some time back so that now you have to use the switch to get the old behaviour (converting messages to one code page on disk) which was retained for compatibility with the DOS version. You can send 8-bit mail with Pegasus by changing a setting in Options, but when you switch it on you get a stern warning about it being "formally illegal" and a "Comments" header is added to each message. I have suggested from time to time over the last few years for Pegasus to be made Unicode aware, but I get the impression it's considered "too hard" or "too complicated" although I don't think I've actually got a reply on this from the author, David Harris. Since there will not be another 16-bit Windows version and the Macintosh version has not been updated in a long time, this leaves only the DOS Win32 versions. Hopefully, this will mean that Unicode will become more of a viable option for him in the future. At the moment, though, he seems quite busy enough adding HTML mail composition to version 4. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] Calumnies are answered best with silence.Ben Johnson.