php-i18n Digest 13 Aug 2008 14:50:24 -0000 Issue 407
Topics (messages 1278 through 1282):
Re: Language detection
1278 by: Ed Batutis
1279 by: Jan Schneider
1280 by: Darren Cook
1281 by: Darren Cook
1282 by: Ed Batutis
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[EMAIL PROTECTED]
----------------------------------------------------------------------
--- Begin Message ---
> I need to write (PHP) code to detect the language of a given block of
> text.
Your proposed approach is very simplistic and probably won't be extensible
if it works at all. Usually a statistical approach is taken using groups of
characters.
In any case, ICU has this. See
http://icu-project.org/userguide/charsetDetection.html
It has both charset and language detection. This is also available via Win32
and .NET APIs in case that helps at all.
If you roll your own you might want to be aware that there are a lot of
patents in this area.
=Ed
--- End Message ---
--- Begin Message ---
Zitat von Darren Cook <[EMAIL PROTECTED]>:
I need to write (PHP) code to detect the language of a given block of
text. (For my purposes I want to initially distinguish between English,
Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic,
Korean, French) I want it to be reliable so my plan was to have a list
of unicode points only found in each given language [1], and use that to
return a high confidence answer. If none found, then have a list of high
frequency words for each language [2] and use that to return a lower
confidence answer.
http://pear.php.net/package/Text_LanguageDetect
Jan.
--
Do you need professional PHP or Horde consulting?
http://horde.org/consulting/
--- End Message ---
--- Begin Message ---
>> I need to write (PHP) code to detect the language of a given block of
>> text.
> In any case, ICU has this. See
>
> http://icu-project.org/userguide/charsetDetection.html
>
> It has both charset and language detection.
Thanks Ed. Unless I've misunderstood, this is just doing charset
detection, with language as a bonus when the charset implies it? If
someone is actually using this and can confirm it can tell the
difference between say English, French and German, all in UTF-8
encoding, please let me know.
Thanks,
Darren
--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
--- End Message ---
--- Begin Message ---
> http://pear.php.net/package/Text_LanguageDetect
Thanks to both people who suggested this; it is just what I was hoping
to find (and that google didn't). I'll start evaluating it then
roll-my-own on top if it isn't accurate enough.
Darren
--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
--- End Message ---
--- Begin Message ---
> Thanks Ed. Unless I've misunderstood, this is just doing charset
> detection, with language as a bonus when the charset implies it?
That wouldn't be very useful. No, it uses recognizers for charset/language
combinations.
> difference between say English, French and German, all in UTF-8
> encoding, please let me know.
It does not have data to do any utf-8 language detection, but the structure
is in place.
You might want to consider adding data to their framework for what you want
to do. It isn't complicated. The most important thing you need is good
sample text in quantity so you can generate the n-gram probability table.
I believe the code was taken from Mozilla, so you might look there. Maybe
they've already done what you are looking for.
=Ed
--- End Message ---