php-i18n Digest 11 Aug 2008 06:11:58 -0000 Issue 406

Topics (messages 1276 through 1277):

Re: intl merged into core
        1276 by: David Zülke

Language detection
        1277 by: Darren Cook

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
Am 14.07.2008 um 08:50 schrieb Stanislav Malyshev:

Hi!

http://pecl.php.net/bugs/bug.php?id=14263 should be fixed asap, as well as http://pecl.php.net/bugs/bug.php?id=14265.

These are fixed now.

Cool!


http://pecl.php.net/bugs/bug.php?id=14266 might also be worthwile to simply change before it gets widespread public attention :)

I'll add it soon.

Well I was wondering if it wouldn't be wise to remove the current options altogether and replace them with the array variants... hence the comment "widespread public attention", as doing it now would only break four or five people's code ;)


David


--- End Message ---
--- Begin Message ---
I need to write (PHP) code to detect the language of a given block of
text. (For my purposes I want to initially distinguish between English,
Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic,
Korean, French) I want it to be reliable so my plan was to have a list
of unicode points only found in each given language [1], and use that to
return a high confidence answer. If none found, then have a list of high
frequency words for each language [2] and use that to return a lower
confidence answer.

Like most of my i18n-related php code I'll release as BSD-license open
source. But I wondered if there already existed something I could build
on. (Or comprehensive lists of unicode points only used in certain
languages; I have some small ad hoc lists, but the more I have the more
useful the algorithm is.)

(I'm aware of letter-frequency techniques,
http://en.wikipedia.org/wiki/Letter_frequencies but haven't worked out
where that is ever more useful than word analysis?)

Darren

[1]: E.g. scharfes-s for German, katakana/hiragana for Japanese (also,
http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters ).
Arabic and Korean also have unique alphabets. Accents for French.

[2]: E.g. for English "the", "be", "to", etc.
http://en.wikipedia.org/wiki/Most_common_words_in_English
Same list for German:
http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache


-- 
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://darrendev.blogspot.com/ (blog on php, flash, i18n, linux, ...)

--- End Message ---

Reply via email to