Bug#777588: ITP: cld2 -- Compact Language Detector 2

Gianfranco Costamagna Tue, 10 Feb 2015 01:33:56 -0800

Package: wnpp
Severity: wishlist
Owner: Gianfranco Costamagna <costamagnagianfra...@yahoo.it>


* Package name    : cld2
Version         : 0.0.0~svn193
Upstream Author : Dick Sites dsi...@google.com 
* URL             : https://code.google.com/p/cld2/
* License         : Apache-2.0
Programming Lang: C++
Description     : Compact Language Detector 2

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either 
plain text or HTML/XML.
Legacy encodings must be converted to valid UTF-8 by the caller. For 
mixed-language input,
CLD2 returns the top three languages found and their approximate percentages of 
the total
text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means 
about 800 bytes
of English and 200 bytes of French). Optionally, it also returns a vector of 
text spans with
the language of each identified. This may be useful for applying different 
spelling-correction
dictionaries or different machine translation requests to each span. The design 
target is web
pages of at least 200 characters (about two sentences); CLD2 is not designed to 
do well on very
short text, lists of proper names, part numbers, etc.

CLD2 is a Naïve Bayesian classifier, using one of three different token 
algorithms. For Unicode
scripts such as Greek and Thai that map one-to-one to detected languages, the 
script defines
the result. For the 80,000+ character Han script and its CJK combination with 
Hiragana,
Katakana, and Hangul scripts, single letters (unigrams) are scored. For all 
other scripts,
sequences of four letters (quadgrams) are scored.

Scoring is done exclusively on lowercased Unicode letters and marks, after 
expanding HTML
entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram 
word beginnings
and endings (indicated here by underscore) are explicitly used, so the word 
_look_ scores
differently from the word-beginning _look or the mid-word look. Quadgram 
single-letter
"words" are completely ignored. For each letter sequence, the scoring uses the 
3-6 most
likely languages and their quantized log probabilities. The training corpus is 
manually
constructed from chosen web pages for each language, then augmented by careful 
automated
scraping of over 100M additional web pages.

Several embellishments improve the basic algorithm: additional scoring of some 
sequences
of two CJK letters or eight other letters; scoring some words and word pairs 
that are
distinctive within sets of statistically-close languages such as {Malay, 
Indonesian}
or {Spanish, Portuguese, Galician}; removing repetitive sequences/words that 
would
otherwise skew the scoring, such as “jpg” in “foo.jpg bar.jpg baz.jpg”; removing
web-specific words that convey almost no language information such as page, 
link,
click, td, tr, copyright, wikipedia, http.

Several hints can be supplied. Because these can be inaccurate on web pages, 
they
are just hints -- they add a bias but do not force a specific language to be the
detection result. The hints include expected language, original document 
encoding,
document URL top-level domain name, and embedded <…lang=xx …> language tags.

The table-driven extraction of letter sequences and table-driven scoring is 
highly optimized
for both space and speed, running about 10x faster than other detectors and 
covering over 70
languages in 1.8MB of x86 code and tables. The main quadgram lookup table 
consists of 256K
four-byte entries, covering about 50 languages. Detection over the average web 
page of 30KB
(half tags/digits/punctuation, half letters) takes roughly 1 msec on a current 
x86 processor.

CLD2 is an update of the prior CLD, adding more languages, updating to Unicode 
6.2 characters,
improving scoring, and adding the optional output vector of labelled language 
spans.

These 83 languages are detected: Afrikaans Albanian Arabic Armenian Azerbaijani 
Basque Belarusian
Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese 
Chinese_T Danish Dhivehi
Dutch English Estonian Finnish French Galician Ganda Georgian German Greek 
Gujarati Haitian_Creole
Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian 
Javanese Japanese Kannada
Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay 
Malayalam Maltese
Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian 
Russian Scots_Gaelic
Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog Tamil 
Telugu Thai
Turkish Ukrainian Urdu Vietnamese Welsh Yiddish.


Useful for the upcoming poedit 1.8 release.


--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/791546596.2444306.1423560498635.javamail.ya...@mail.yahoo.com

Bug#777588: ITP: cld2 -- Compact Language Detector 2

Reply via email to