i suppose this is a really simple minded question but is there any way of
telling if an incoming chunk of text (say from a browser form) is
traditional or simplified chinese?
thanks.
Paul Hastings [EMAIL PROTECTED]
Director
Zhang Weiwu from Xiamen China
- Original Message -
From: "Paul Hastings" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, February 13, 2003 7:35 PM
Subject: traditional vs simplified chinese
> i suppose this is a really simple minded question but
Paul Hastings wrote:
> i suppose this is a really simple minded question but is
> there any way of telling if an incoming chunk of text
> (say from a browser form) is traditional or simplified
> chinese?
Please notice that the classification you want is not always meaningful.
E.g., what if the in
- Original Message -
From: "Paul Hastings" <[EMAIL PROTECTED]>
To: "Zhang Weiwu" <[EMAIL PROTECTED]>
Sent: Thursday, February 13, 2003 9:16 PM
Subject: Re: traditional vs simplified chinese
> >meaning "for" (wei in Mandarin pinyin) is th
On Thursday, February 13, 2003, at 07:18 AM, Marco Cimarosti wrote:
3) All other characters listed in Unihan.txt are *both*
"Traditional" and "Simplified".
Actually, this is not quite true. Even though the current set of
traditional/simplified data is much better than it's ever been, we
logically correct documents" that could contain
both characters:
- a bibliography containing books published Mainland China and in Taiwan;
- an article about the Chinese writing system;
- the table of traditional vs. simplified Chinese character;
- discussions on a Chinese newsgroup or ma
Hi, Paul,
On Thu, 13 Feb 2003, Zhang Weiwu wrote:
> - Original Message -
> From: "Paul Hastings" <[EMAIL PROTECTED]>
> To: "Zhang Weiwu" <[EMAIL PROTECTED]>
> Sent: Thursday, February 13, 2003 9:16 PM
> Subject: Re: traditional vs simplifie
> So I think Zhang Weiwu is suggesting a heuristic algorithm for
> discriminating a unicode text which is already known, or assumed to be, in
> Chinese.
well the site will deliver chinese content w/doublechecking browser locale,
etc. so yes, most likely chinese users.
>to encounter at least o
> Please notice that the classification you want is not always meaningful.
> E.g., what if the incoming text is in Spanish? Would you classify it as
> traditional or simplified Chinese?...
as spanish i guess. the website will deliver chinese content & with some
browser locale checking should be ok
On Thu, 13 Feb 2003 09:48:45 -0800 (PST), "Zhang Weiwu" wrote:
> Take it easy, if you find one 500B (the measure word) it is usually enough to
> say it is traditional Chinese, one 4E2A (measure word) is in simplified
> Chinese. They never happen together in a logically correct document.
Marco i
Edward H Trager wrote:
> [...]
> If I were going to write such an algorithm, I would:
>
> * First, insure that the incoming text stream to be classified was
>sufficiently long to be probabilistically classifiable. In other
>words, what's the shortest stream of Hanzi characters needed, on
t: RE: traditional vs simplified chinese
Paul wrote:
> To: Edward H Trager
> > Marco Cimarosti has questioned, why do you need to classify
> > text as being simplified or traditional?
>
> if i understand their needs correctly, its to implement a
> search system with search phrases o
Paul wrote:
> To: Edward H Trager
> > Marco Cimarosti has questioned, why do you need to classify
> > text as being simplified or traditional?
>
> if i understand their needs correctly, its to implement a
> search system with search phrases of either "type" of
> chinese--content would be in bot
On Thu, 13 Feb 2003, Andrew C. West wrote:
> On Thu, 13 Feb 2003 09:48:45 -0800 (PST), "Zhang Weiwu" wrote:
>
> > Take it easy, if you find one 500B (the measure word) it is usually enough to
> > say it is traditional Chinese, one 4E2A (measure word) is in simplified
> > Chinese. They never ha
On Fri, 14 Feb 2003, Paul Hastings wrote:
> > So I think Zhang Weiwu is suggesting a heuristic algorithm for
> > discriminating a unicode text which is already known, or assumed to be, in
> > Chinese.
>
> well the site will deliver chinese content w/doublechecking browser locale,
> etc. so yes, m
On Thu, 13 Feb 2003, Rick Cameron wrote:
> The Win32 API includes a function that can do this folding, on Windows
> NT/2000/XP: LCMapString, with the option LCMAP_SIMPLIFIED_CHINESE or
> LCMAP_TRADITIONAL_CHINESE.
>
> I know little about Chinese, but I have the impression that it is much more
>
> -Original Message-
> From: Edward H Trager [mailto:[EMAIL PROTECTED]]
>
> On Thu, 13 Feb 2003, Rick Cameron wrote:
>
> > The Win32 API includes a function that can do this folding,
> on Windows
> > NT/2000/XP: LCMapString, with the option
> LCMAP_SIMPLIFIED_CHINESE or
> > LCMAP_TRAD
I say live with it.
This happens in Japanese as well, and it gets even worse when searching
in romazi, European letters, because there are so many different ways of
spelling things, and all the Chinese borrow words mean and sound exactly
the same.
But when the whole point of the system is to s
Andrew C. West" <[EMAIL PROTECTED]>
wrote on Friday, February 14, 2003 2:29 AM
Subject: Re: traditional vs simplified chinese
> On Thu, 13 Feb 2003 09:48:45 -0800 (PST), "Zhang Weiwu" wrote:
>
> > Take it easy, if you find one 500B (the measure word) it i
> > I know little about Chinese, but I have the impression that it is much more
> > common for several traditional characters to correspond to one simplified
> > character than vice versa. If that's true, it seems to me that it would make
> > most sense to fold to simplified.
>
> Hmmm ... Suppose I
On Fri, 14 Feb 2003 01:23:42 -0800 (PST), "Zhang Weiwu" wrote:
> I never saw 500B and 4E2A in one same printed document as I lived in China for
> 20 years. (Well, need to remove the years I cannot read:) Unless you have a
> obvious reason to do so, to print a book with Traditional characters is
>
Andrew C. West scripsit:
> Interestingly, the dictionary quotes Zheng Xuan, writing in the 2nd century
> A.D., as stating that U+4E2A (the modern "simplified" form) is the correct form
> of the character, and that U+500B (the modern "traditional" form) is a vulgar
> substitute !
IIRC this is true
On Thu, 13 Feb 2003, Zhang Weiwu wrote:
>Take it easy, if you find one 500B (the measure word) it is usually enough to
>say it is traditional Chinese, one 4E2A (measure word) is in simplified
>Chinese. They never happen together in a logically correct document.
Others have already given examples
On Fri, 14 Feb 2003 07:45:44 -0800 (PST), Thomas Chan wrote:
> I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
> simple heuristic for modern text, since it occupies position #11 in at
> least one frequency list (compared to #15 for the above-cited ge4), and as
> far as I
Andrew C. West scripsit:
> On a related matter, I was wondering about language tagging for Chinese. "zh-CN"
> and "zh-TW" are used quite frequently, but what do they imply ?
They are usually (mis)used to mean "Mandarin, simplified characters" and
"Mandarin, traditional characters" respectively.
Haha, I just realized I stole my new sig from your page. Haha, neat!
--
New Norwegian (Nynorsk) is essentially the speech of Norwegian peasants
as mutilated by a schoolteacher with a poor understanding of Icelandic.
--Halldór Laxness, via B. Philip Jonsson
Swedish, Norwegian and Danish are actual
26 matches
Mail list logo