[jira] Updated: (LUCENE-826) Language detector

Karl Wettin (JIRA) Wed, 07 Mar 2007 22:34:50 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karl Wettin updated LUCENE-826:
-------------------------------

    Attachment: ld.tar.gz

Added support for all modern large germanic, balto-slavic, latin and some other 
languages. I'll add the complete indo-iranian tree soon.

The test case will gather and classify random pages from wikipedia in the 
target language. Only on too small articles (again, I say that 160 charaters, 
one paragraph, is required) or them with very mixed language (article talking 
about something like a discography of a non native band) is there a false 
positive.

Documents with mixed languages could probably be handled at paragraph level, 
reporting back as the document is in language A, but contains paragraphs 
(quotes, et c) in language B and C.

Supported languages(35):

swedish
danish
norwegian
islandic
faroese

dutch
afrikaans
frisian

low german
german

english

latvian
lithuanian

russian
ukranian
belarussian

czech
slovak
polish

bosnian
croatian
macedonian
bulgarian
slovenian
serbian

italian
spanish
french
portugese

armenian

greek

hungarian
finnish
estonian

modern persian (farsi)

There are some languages in the training set that due to low representation in 
Wikipedia also have problems with false positive classifications: 

Faroese with its 80 paragraphs (mean is 600) get some 60% false positives. 

Macedonian with its 150 paragraphs get 45% false positives, most often Serbian.

Croatian is often confused with Bosnian.

Also, some of these southern slavic languages can use either cyrillic or latin 
alphabet, and this is something I should consider a bit. 

All other languages are detected without any problems.

One simple way to get the false positives better here is to manually check the 
training data. There is some <!-- html comments --> here and there. Hopefully 
they are washed away with the feature selection.

Preparing the training data (download data from Wikipedia, parse, tokenize) for 
all them languages takes just a few minutes on my dual core, but the token 
feature selection (selecting the 7000 most prominent tokens out of 65000, in 
20000 paragraphs of text) takes 90 minutes and consumes something like 700MB 
heap. 

Once the arff-file is create the classifier takes 10 minutes to compile (the 
support vectors) and once done it consumes not more than a fistful of MB. It 
could probably be serialized and dumped to disk for faster loading at startup 
time.

The time it takes to classify a document will of course depend on its size. 
Wikipedia articles average out on about 500ms.

For a really speedy classification of very large texts one could switch to 
REPtree instead of SVM. It does the job 95% as well (with a big enough text), 
but at 1% of the time or 2ms per classification. I still focus on 160 charaters 
long paragraphs though.

Next step is optimizations. The current training data for the 35 languages is 
25000 instances and 7000 attributes. That is an instane amount of data. Way too 
much.

I think the CPU performance and RAM requirements can be optimized quite some by 
simply make the number of training instances (paragraphs) a bit more even. 500 
per language. It is quite gaussian right now, and that is wrong. Also, by 
selecting 100*language attributes (tokens) for use in the SVM rathern than 200 
as now does not do much to the classification quality, but would make the speed 
in creating training data and building the classifier to sqrt(what it is now).

For now I run on my 6 languages. It takes just a minute to download data from 
Wikipedia, tokenize and build the classifier. And classification time is about 
100ms on average for a Wikipedia article.




> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of 
> text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification 
> (logistic support vector models) feature selection and normalization of token 
> freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new 
> File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of 
> each registred language in the language to train. Above example pass this 
> test:
> (testEquals is the same as assertEquals, just not required. Only one of them 
> fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the 
> document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs 
> for now. I'll try do more work on considering the language trees when 
> classifying.
> It takes a bit of time and RAM to build the training data, so the patch 
> contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-826) Language detector

Reply via email to