RE: extracting words

2001-02-12 Thread jarkko . hietaniemi

  - line break (wrapping lines on the screen)
  - word break (for selection)
  - word/root extraction (for search)
 
 I recognize that the second and third case are really 
 difficult to handle.

Root extraction is decidecly non-trivial and a highly language-specific
problem, even more so than word breaking, it's a messy linguistic problem
instead of a clean algoritmic problems.
To start with, the choice of the term "extraction" shows that one has not
understood the problem in all its g(l)ory: a more appropriate term would be
"finding", or maybe, "reducing" the root.

Also, I would add

- "syllablization" (is that a word?) as a third problem (for breaking words
more nicely into lines), it would rank in difficulty somewhere between word
breaking and root extraction.

 But for word wrapping I assume line 
 breaking is sufficient. But when I don't have spaces to use 
 for wrapping and/or don't know whether the actual text part 
 uses spaces at all (what about exotic languages like Ogham or 
 Anglo-saxon?) then how can I go to implement word wrapping? 
 Simply do it character by character?
 



RE: extracting words

2001-02-12 Thread Mark Leisher


  - line break (wrapping lines on the screen)  - word break (for
 selection)  - word/root extraction (for search)
 
 I recognize that the second and third case are really difficult to
 handle.

Jarkko Root extraction is decidecly non-trivial and a highly
Jarkko language-specific problem, even more so than word breaking, it's a
Jarkko messy linguistic problem instead of a clean algoritmic problems.
Jarkko To start with, the choice of the term "extraction" shows that one
Jarkko has not understood the problem in all its g(l)ory: a more
Jarkko appropriate term would be "finding", or maybe, "reducing" the
Jarkko root.

The words we use in computational linguistics are "stemming" and less
frequently "lemmatization."  This is often the step in morphological analysis
that precedes determining the part-of-speech.  Jarkko is right that it is a
messy problem for many languages.

Jarkko - "syllablization" (is that a word?) as a third problem (for
Jarkko breaking words more nicely into lines), it would rank in
Jarkko difficulty somewhere between word breaking and root extraction.

I believe "syllabization" or perhaps "syllabification" might be the term.

 But for word wrapping I assume line breaking is sufficient. But when I
 don't have spaces to use for wrapping and/or don't know whether the
 actual text part uses spaces at all (what about exotic languages like
 Ogham or Anglo-saxon?) then how can I go to implement word wrapping?
 Simply do it character by character?
 
Spaces and other punctuation come in handy for line breaking.  Segmentation is
used with scripts that don't use this sort of intra-sentence term separation
(i.e. Chinese, Japanese, Thai).  There are whole conferences devoted to
segmentation approaches.  Another messy area of computational linguistics :-)
If segmentation is not available, then lines are often wrapped between
characters.
-
Mark Leisher  But there is no doubt but money is to the
Computing Research Labfore now.  It is the romance, the poetry
New Mexico State University   of our age.  It's the thing that chiefly
Box 30001, Dept. 3CRL strikes our imagination.
Las Cruces, NM  88003 -- The Rise of Silas Lapham, W. D. Howells



[OT?] Re: extracting words

2001-02-12 Thread DougEwell2

In a message dated 2001-02-12 8:54:10 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Also, I would add
  
  - "syllablization" (is that a word?) as a third problem (for breaking words
  more nicely into lines), it would rank in difficulty somewhere between word
  breaking and root extraction.

I think the canonical word is "syllabification," but from a word-inventing 
perspective, I agree with Jarkko's first instinct.  The suffix "-ize" seems 
more appropriate to the process being discussed than "-fy".

-Doug Ewell
 Fullerton, California



FW: Doubt about XML

2001-02-12 Thread Magda Danish (Unicode)



-Original Message-
From: Miguel Angel Lopez [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 12, 2001 2:36 AM
To: [EMAIL PROTECTED]
Subject: Doubt


Good morning. I write from Spain
I have one doubt, and I wonder if you can help me.

I want my xml file to have required "attlist", so I put in my DTD file
the
next text:

.
!ATTLIST Identificacion IdCliente NMTOKEN #REQUIRED
Nombre NMTOKENS #REQUIRED
Apellido1 NMTOKENS #REQUIRED
Apellido2 CDATA #IMPLIED


The problem is:
1. If in the corresponding xml file the IdCliente attribute  equals "",
it does not produce error !!
2. If in the corresponding xml file the Nombre attribute  equals "JohnO"

it's OK, but if it is equals "John'O"
(nfpl !!!), it gifs error!!!


How can I do to get that in the first case produce error and in the
second
not to produce error.

Thanks and excuse my poor English





begin 600 malgonzalez.vcf
M8F5G:6XZ=F-AF0@#0IN.DQO5Z.TUI9W5E;"!!;F=E;`T*"UM;WII;QA
M+6AT;6PZ1D%,4T4-"G5R;#IW=WN:6YDF$N97,-"F]R9SI!5$Q!3E1%(%-)
M4U1%34%3+"!3+DPN.TEN9V5N:65R:6$@2!$97-AG)O;QO#0IV97)S:6]N
M.C(N,0T*96UA:6P[:6YT97)N970Z;6%L9V]NF%L97I`:6YDF$N97,-"F%D
MCMQ=6]T960M')I;G1A8FQE.CL[8R\@5F5L87IQ=65Z(#$S,CTP1#TP03M-
M861R:60@("`Y,2`S-#@@,3(@-S@[.SM%W!A\6$-"F9N.DUI9W5E;"!!;F=E
4;"!,;W!E@T*96YD.G9C87)D#0H=
`
end



Re: Teletext mappings

2001-02-12 Thread Alain LaBonté 

About this topic, please note (for what it's worth) that I did such a 
mapping a while ago, in the making of Canadian standard CAN/CSA Z243.4.1 
(Ordering standard for French and English) and CAN/CSA Z243.230 
(Localization parameters for French and English as used in Canada). It is 
possible that I goofed for some characters though, in absence of any clue, 
particularly for non-spacing characters and particularly because I went 
beyond Telelex, including NAPLPS CS (North American videotex character set, 
still in use). Dr Umamaheswaran revised this data at IBM but I don't know 
if this company had better data than I had and for which I had to make some 
bold decisions, I must admit (decisions not challenged for years)... If 
there is somebody guilty of any mistake in those standards, I am... In 
those standards I mapped all characters using U notation...

Alain LaBont, Qubec
Page personnelle : http://www.iquebec.com/cyberiel





Re: extracting words

2001-02-12 Thread Michael \(michka\) Kaplan

From: "Kenneth Whistler" [EMAIL PROTECTED]

 the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists,
I
 know that the term "syllable" is not technically correct here, so please
don't
 nitpick me to death on this one. ;-)

Ironically enough, there are a number of native speakers who struggle with
the fact that "syllable" is apparently the best available word for them, if
all of the usual connotations could be dispensed with.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/





Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin




On Sun, 11 Feb 2001, Mark Davis wrote:

MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
MD recommended in my last message. The Unicode standard is online, as is the
MD TR. Both can be found by going to www.unicode.org, and selecting the right
MD topic. The TR in particular discusses the recommended approach to line break
MD in great detail.

As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
that leads to the following in the TR on line breaking (and what's written
about it in Chap 5o of TUS 3.0) came from.

UTR14   Korean may alternately use a space-based (style 1) instead of the
UTR14   style 2 context analysis.

UTR14 1.  Korean uses either implicit breaking around
UTR14 Hangul and ideographs or uses spaces. Reference [1] shows
UTR14 how this can be elegantly handled by the second or third
UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL
UTR14 are affected. For alphabetic style line breaking, breaks
UTR14 for these four cases require space, for ideographic style
UTR14 line breaking, these four cases don't require spaces.

where style 1 and style2 are defined as

UTR14 1. Western (spaces and hyphens are used to determine breaks)
UTR14 2. East Asian (lines can break anywhere, unless prohibited)


Let me make it clear that virtually NO books published in Korean uses
space-based (style 1) line breaking rule. Style 2 line breaking rule
is *exclusively* used for modern Korean text no matter what some broken
word processors for Korean offer as an alternative to style 2 and what
some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do.

I'm very alarmed to find this 'misinformation' crept into the UTS and
UTR14 (now UAX #14). It would be nice if  somebody in charge could get
this straightened.

Regards,

Jungshik Shin




international characters in email subject line

2001-02-12 Thread Raghu Kolluru

Greetings!

I would like to send email in international charsets. I am able to send the
body using the desired charset but not the subject line.
Any help would be appreciated.
Thanks.



Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

What mail program are you using?

Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
support international text in the body.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 2:37 PM
Subject: international characters in email subject line


 Greetings!

 I would like to send email in international charsets. I am able to send
the
 body using the desired charset but not the subject line.
 Any help would be appreciated.
 Thanks.





Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

Well, like I said Outlook does not support this -- it will only use the
default system code page (b.k.a. CP_ACP) for subject lines and any other
part of the header.

michka

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Monday, February 12, 2001 3:29 PM
Subject: RE: international characters in email subject line


 I wrote a java application which sends emails to a relay server (Postfix).
 My email client is outlook which does support international character
sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?
 Thanks.

 -Original Message-
 From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 12, 2001 3:21 PM
 To: Raghu Kolluru; Unicode List
 Subject: Re: international characters in email subject line


 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not
even
 support international text in the body.

 michka

 a new book on internationalization in VB at
 http://www.i18nWithVB.com/

 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line


  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 





RE: international characters in email subject line

2001-02-12 Thread Raghu Kolluru

Michael,
Do you know of any email client which CAN do this and also display the from
alias of the email in the desired charset?
Thanks.

-Original Message-
From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 12, 2001 3:31 PM
To: Raghu Kolluru; Unicode List
Subject: Re: international characters in email subject line


Well, like I said Outlook does not support this -- it will only use the
default system code page (b.k.a. CP_ACP) for subject lines and any other
part of the header.

michka

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Monday, February 12, 2001 3:29 PM
Subject: RE: international characters in email subject line


 I wrote a java application which sends emails to a relay server (Postfix).
 My email client is outlook which does support international character
sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?
 Thanks.

 -Original Message-
 From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 12, 2001 3:21 PM
 To: Raghu Kolluru; Unicode List
 Subject: Re: international characters in email subject line


 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not
even
 support international text in the body.

 michka

 a new book on internationalization in VB at
 http://www.i18nWithVB.com/

 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line


  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 




Re: international characters in email subject line

2001-02-12 Thread Keld Jørn Simonsen

The email program I am using, mutt, can do this.

Kind regards
keld Simonsen

On Mon, Feb 12, 2001 at 02:55:41PM -0800, Michael (michka) Kaplan wrote:
 What mail program are you using?
 
 Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
 support international text in the body.
 
 michka
 
 a new book on internationalization in VB at
 http://www.i18nWithVB.com/
 
 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line
 
 
  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 
 



Re: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Michael (michka) Kaplan wrote:


 From: "Raghu Kolluru" [EMAIL PROTECTED]

  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.

The question is so vague. If you need to get some help, you've gotta
provide as much information as possible(what mail program under what OS
for what character set).  There are so many possibilities and nobody
would wish to go thru all of them.

 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
 support international text in the body.

Mozilla and Netscape 6 support entering subject header in whatever script
for which input methods are available/installed in the OS (MS-Windows,
MacOS, Unix/X11). In this respect, I18N of Mozilla/Netscape 6 is ahead
of that of MS Outlook. The same is true of  display of subject headers
in scripts which happens not to be supported by the default codepage
(to use MS terminology). BTW, one of the worst MUAs in terms of I18N
(among the widely used) might be Eudora.

BTW, most modern Unix text-based mail   programs (e.g. Pine, Mutt)
work fine in this regard as long as you run them under the terminal
that supports input/ouput of the charset you want to use
(for UTF-8, the newest xterm works well for a pretty large range
of the BMP).

Jungshik Shin




[OT]RE: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Raghu Kolluru wrote:

 I wrote a java application which sends emails to a relay server (Postfix).

When you write your java  application, note that any 8bit character is
explicitly prohibited(IETF STD 11/RFC 822).  You need to encode them
per IETF RFC 2047 (and RFC 2184, 2231). Some MTAs(mail transport agent)
refuse to accept messages with 8bit characters in the header depending on
the configuration. BTW, the header encoding is not just for working around
those MTAs  but also for the sake of identifying MIME charset/encoding
used and allowing the possibility of multiple MIME charset/encoding mixed
in the header (the latter might be mute when UTF-8 is  exclusively used)



 My email client is outlook which does support international character sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?

Usenet newsgroup comp.mail.mime  is the best place to ask your question.
(it has the mail-submission address as well, but I don't know it)
BTW, MS OE doesn't support it while Mozilla does support it.

Jungshik Shin

P.S. I'm afraid Unicode mailing list server strips off  too many header
lines of messages. In this case and some  other cases(e.g. when people
talke about the safe 'transport' of UTF-8 messages), 'X-Mailer:' header
would be nice to have.






Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Mark Davis

Asmus Freytag is the one to talk to; he can look into this.

Mark

- Original Message -
From: "Jungshik Shin" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 13:33
Subject: Korean linebreking and UTR14(was Re: extracting words)





 On Sun, 11 Feb 2001, Mark Davis wrote:

 MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
 MD recommended in my last message. The Unicode standard is online, as is
the
 MD TR. Both can be found by going to www.unicode.org, and selecting the
right
 MD topic. The TR in particular discusses the recommended approach to line
break
 MD in great detail.

 As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
 that leads to the following in the TR on line breaking (and what's written
 about it in Chap 5o of TUS 3.0) came from.

 UTR14   Korean may alternately use a space-based (style 1) instead of the
 UTR14   style 2 context analysis.

 UTR14 1.  Korean uses either implicit breaking around
 UTR14 Hangul and ideographs or uses spaces. Reference [1] shows
 UTR14 how this can be elegantly handled by the second or third
 UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL
 UTR14 are affected. For alphabetic style line breaking, breaks
 UTR14 for these four cases require space, for ideographic style
 UTR14 line breaking, these four cases don't require spaces.

 where style 1 and style2 are defined as

 UTR14 1. Western (spaces and hyphens are used to determine breaks)
 UTR14 2. East Asian (lines can break anywhere, unless prohibited)


 Let me make it clear that virtually NO books published in Korean uses
 space-based (style 1) line breaking rule. Style 2 line breaking rule
 is *exclusively* used for modern Korean text no matter what some broken
 word processors for Korean offer as an alternative to style 2 and what
 some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do.

 I'm very alarmed to find this 'misinformation' crept into the UTS and
 UTR14 (now UAX #14). It would be nice if  somebody in charge could get
 this straightened.

 Regards,

 Jungshik Shin





Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 15:06 scrobh Michael (michka) Kaplan
fn bhar "Re: international characters in ema":

 Well, like I said Outlook does not support this -- it will only use the
 default system code page (b.k.a. CP_ACP) for subject lines and any other
 part of the header.

Ar 12 Feb 2001, ag 15:46 scrobh Jungshik Shin
fn bhar "[OT]RE: international characters in":

 On Mon, 12 Feb 2001, Raghu Kolluru wrote:

  My email client is outlook which does support international character
  sets. I can send/recieve non-ascii encoded body but not the subject line.
[snip]
 BTW, MS OE doesn't support it while Mozilla does support it.

This is simply not true!  I know we all like to bash MS from time to time,
but people really get far too carried away.  I don't know if the above is
true about Outlook (as my installation is stuffed as far as e-mail goes), but
it is NOT TRUE about Outlook Express.  OE encodes the subject line with the
same encoding as the body and often (?) the From header as well.

Whether or not this works for you would probably depend on what OS you are
using and what language features are installed.  It works for me with OE
5.50.4133.2400 on Windows NT 4.0 SP5.

Of course, since my preferred mail program is Pegasus Mail, which can only be
configured for one character set, I can't usually read such headers anyway.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

Nuair a bhonn an fon istigh, bonn an ciall amuigh.  Seanfhocal.




Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin



On Mon, 12 Feb 2001, Mark Davis wrote:

Thank you for your answer.

 Asmus Freytag is the one to talk to; he can look into this.

Do you think I should contact him directly off-line? I thought he's on
this list now as well as  back in March 2000 when I wrote about TUS 3.0
p. 124.

  On Mon, 12  Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote:
  On Sun, 11 Feb 2001, Mark Davis wrote:
 
  MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
  MD recommended in my last message. The Unicode standard is online, as is

  As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
  that leads to the following in the TR on line breaking (and what's written
  about it in Chap 5o of TUS 3.0) came from.
 
  UTR14   Korean may alternately use a space-based (style 1) instead of the
  UTR14   style 2 context analysis.

BTW, this clearly shows that what Rick McGowan wrote about 'either ... or'
in response to what I wrote about Korean line breaking rule (TUS 3.0
p. 124) in March 2000 is not right like I argued then.  I'm sure he's
right about 'either ... or ' in English grammar but the intention of the
author is on my side if the author of UTR 14 is the same as that of the
part  in question in TUS 3.0. I'm enclosing at the end of this message
a part of my message in response to him.


  I'm very alarmed to find this 'misinformation' crept into the UTS and
  UTR14 (now UAX #14). It would be nice if  somebody in charge could get
  this straightened.

This didn't make it in Unicode 3.1, either. What would be the best way
to get it addressed before next revision comes out? I'm afraid just
raising it  on this list wouldn't be sufficient (of course, I should
have followed up more vigorously last year)

Regards,

Jungshik Shin


Enc.

1. Two messages of mine
   the first one : March 1, 2000
   the second one: March 2, 2000

From: Jungshik Shin [EMAIL PROTECTED]
Subject: Korean line breaking rules : Unicode 3.0 (p. 124)
Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST)

On Sun, 13 Feb 2000, Kenneth Whistler wrote:

 Lest anyone feel unduly constrained, let me note that now that
 the editorial committee has closed the book, so to speak, on Unicode 3.0,
 all of you who are about to open the book for the first time should
 feel free to unleash your commentary on the text.

   I've just received my copy of Unicode 3.0 book, here goes
my first commentary.

   On page 124(section 5.15 Locatiing Text element boundaries),
the third paragraph has the following around the end:

U3.0 In particular, word, line, and sentence boundaries will need to
U3.0 be customized according to locale and user preference. In Korean,
U3.0 for example, lines may be broken either at spaces(as in Latin text) or
U3.0 on ideographic boundaries (as in Chinese).

  First of all, it's a great mystery to me how on earth this
strange notion of Korean having *two* different line breaking rules(as
opposed to one)  crept into the expertise of non-Korean experts on Korean
and finally made it into Unicode 3.0 book and Unicode TR on line breaking.

  None of tens of Korean books on my bookshelves
I've just gone through breaks lines *exclusively* at spaces. All of them
break lines freely at *syllables*. Only places where lines are broken
*exclusively* at spaces(for Korean text)  I can think of are completely
*broken*(as far as Korean line breaking is concerned) web browsers like
Netscape and MS IE and possibly earlier implementations of Korean LaTeX.
One may add  to the list Korean text formatted by non-localized version
of 'fmt' (in Unix) as another example. To work around the problem caused
by these broken web browsers, some Korean web authors apply a simple
filter to insert wbr between every pair of Korean syllables to their
html files. To see what I mean, you may wanna take a look at
http://photon.hgs.yale.edu/~jungshik/lb.html and
http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg

  Let me emphasize that line can be broken at any syllable boundaries
in Korean text (except for some obvious exceptions as applied in English
text: i.e. punctuation marks like '!', '?' cannot begin a line).

  Secondly, even in Latin scripts(well, at least in English) lines can
be broken not only at spaces but also at syllables(syllabic boundaries)
with hyphen.  Only difference between Korean line breaking and English
line breaking is Korean doesn't need hyphen when lines are broken at
syllables because in Korean syllables  form  another visual unit a level
higher than alphabetic/phonetic letters(consonants and vowels).

  Thirdly,  the expression 'ideographic boundaries' is not appropriate
'syllabic boundaries' or 'syllables'.

  Given these, I'd like to  suggest the last sentence(that begins with
'In Korean, for instance...') be removed in the future edition because
Korean is NOT a good example case where there can be multiple line
breaking rules depending on user preference.

Jungshik Shin

From: Jungshik Shin [EMAIL PROTECTED]
Subject: RE: Korean 

Re: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Sean O Seaghdha wrote:

 On 12 Feb 2001,   Michael (michka) Kaplan wrote:

  Well, like I said Outlook does not support this -- it will only use the
  default system code page (b.k.a. CP_ACP) for subject lines and any other
  part of the header.

 On 12 Feb 2001,  Jungshik Shin wrote:

  On Mon, 12 Feb 2001, Raghu Kolluru wrote:

   My email client is outlook which does support international character
   sets. I can send/recieve non-ascii encoded body but not the subject line.
 [snip]
  BTW, MS OE doesn't support it while Mozilla does support it.

 This is simply not true!  I know we all like to bash MS from time to time,
 but people really get far too carried away.  I don't know if the above is
 true about Outlook (as my installation is stuffed as far as e-mail goes), but
 it is NOT TRUE about Outlook Express.  OE encodes the subject line with the
 same encoding as the body and often (?) the From header as well.

I stand corrected(thank you for correcting me). It's possible to enter
whatever script supported by IMEs installed on your system in both
Subject(and other headers) and body of the message. However, what
I wrote about the display of the headers in scripts NOT supported
by the default system code page still stands.  For instance, MS
OE cannot display  Korean, Japanese, Chinese, Russian headers under
English/French/Spanish/Italian/German MS-Windows in _the message *list*
display pane_, which Mozilla can. MS OE  can display those headers for
individual messages.), though.

Not having checked out MS OE for a while, I was a bit confused what is
possible and what is not. Anyway, my comment and michka's have *nothing*
to do with MS bashing. I was just giving what I believed to be facts,
one of which was not true as it turned out.  Please, note that  Michael
(michka) Kaplan, I guess is, one of the last persons on this list to say
something not true just to make MS look bad. Of course, by this I'm not
implying by any means  that there are some people who would do that on
this list.

Jungshik Shin




Re: extracting words

2001-02-12 Thread Jungshik Shin




On Sun, 11 Feb 2001, Mark Davis wrote:

 BTW, someone on this thread made this topic out to be even more complex than
 is: that Devanagari and Korean are written without spaces. While that may
 have been the case historically, I believe that the modern text does use
 spaces. Chinese, Japanese and Thai are the main languages written without
 spaces.

As I wrote earlier and you correctly believe, spaces are used to separate
words in Korean text. That has been the case at least since the Korean
Linguistic Society - KLS: Hangul Hakhoe - published the unified rules of
Korean orthography in 1933. This practice of using spaces must have been
predominant well before that because otherwise the Korean Linguistic
Society might not have come up with that. The ortographic standards
of both North and South Korea agree on this point.  More details are
available at http://www.hangeul.or.kr in Korean only. The full text
of various standards at the site - four orthographic standards (KLS :
1933, 1980, North Korea: 1987, South Korea MOE: 1988), transliteration of
foreign words in Hangul(South Korea MOE, 1985), transcrption of Korean in
Roman alphabets - are only available in HWP - one of the most popular word
processors in Korea -  format which can be viewed with Namo HWP viewer
for MS-Windows at http://www.namo.co.kr/download/dwn_hwpv.html. People
in the US may find that the bottom of each page gets cropped if printed
directly from Namo HWP viewer as they're made for A4 paper. A way around
is print to a file (using a PS printer driver) and use ghostscript to
print (using PDFWriter may do the same trick). If interested, drop me
a line off-line and I'll send a copy either in PDF or PS (resized to
better fit US letter paper if necessary)

Jungshik Shin




Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 20:40 scrobh Jungshik Shin
fn bhar "Re: international characters in ema":

 I stand corrected(thank you for correcting me). It's possible to enter
 whatever script supported by IMEs installed on your system in both
 Subject(and other headers) and body of the message. However, what
 I wrote about the display of the headers in scripts NOT supported
 by the default system code page still stands.  For instance, MS
 OE cannot display  Korean, Japanese, Chinese, Russian headers under
 English/French/Spanish/Italian/German MS-Windows in _the message *list*
 display pane_, which Mozilla can. MS OE  can display those headers for
 individual messages.), though.

Thank you for your clarification.  MS OE doesn't show any chars outside the
system code page in the message list, only in the preview pane and message
windows.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

N bhonn tran buan.  Seanfhocal.




Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

From: "Jungshik Shin" [EMAIL PROTECTED]

 Please, note that  Michael (michka) Kaplan, I guess is, one of
 the last persons on this list to say something not true just to
 make MS look bad.

There are a few program managers in Office and Visual Studio who might
disagree with this statement -- they seem to think I live to bash Microsoft.
They are mistaken, sadly. But no company is above having their boneheaded
decisions called out, something not everyone there understands.

Its nice that you do, though. :-)

 Of course, by this I'm not implying by any means that there
 are some people who would do that on this list.

Its ok, we all know that such people exist; heck, we probably all know who
they are, too. As long as we don't name names, no can claim to be offended
unless they have felon's guilt or something. :-)

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/




Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 20:28 scrobh Alain LaBont
fn bhar "Re: international characters in ema":

  19:53 01-02-12 -0800, Sean O Seaghdha a crit:
 
 Of course, since my preferred mail program is Pegasus Mail, which can only be
 configured for one character set, I can't usually read such headers anyway.

 [Alain]  Some years ago, I was also using Pegasus mail and I was not
 satisfied with this. I then communicated with the author directly (he lives
 in Sourthern New Zealand); we engaged in a series of exchanges and I made
 him accept to carry on the character set in use without conversion [in my
 case the Windows character set]... You have to use a parameter for this,
 this is the compromise he made me accept because he was really impressed by
 the SMTP 7-bit-only-headers dogma -- which does not impress me since it
 works any way with 8-bit-clean systems [predominant nowadays in the world
 since a serious security breach, I was told, was corrected with an
 8-bit-clean-enabling SMTP patch].

I think there are a couple of different issues here.  As far as storage on
disk goes, I think this changed some time back so that now you have to use
the switch to get the old behaviour (converting messages to one code page on
disk) which was retained for compatibility with the DOS version.

You can send 8-bit mail with Pegasus by changing a setting in Options, but
when you switch it on you get a stern warning about it being "formally
illegal" and a "Comments" header is added to each message.

I have suggested from time to time over the last few years for Pegasus to be
made Unicode aware, but I get the impression it's considered "too hard" or
"too complicated" although I don't think I've actually got a reply on this
from the author, David Harris.  Since there will not be another 16-bit
Windows version and the Macintosh version has not been updated in a long
time, this leaves only the DOS  Win32 versions.  Hopefully, this will mean
that Unicode will become more of a viable option for him in the future.  At
the moment, though, he seems quite busy enough adding HTML mail composition
to version 4.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

Calumnies are answered best with silence.Ben Johnson.