[jira] Commented: (LUCENE-826) Language detector

Karl Wettin (JIRA) Tue, 06 Mar 2007 22:33:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478691
 ]


Karl Wettin commented on LUCENE-826:
------------------------------------

Some performance in numbers: using only 160+ character long paragraphs as 
training data I get these results from a 10-fold cross validation:


Time taken to build model: 2.12 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1199               98.6831 %
Incorrectly Classified Instances        16                1.3169 %
Kappa statistic                          0.9814
Mean absolute error                      0.2408
Root mean squared error                  0.3173
Relative absolute error                 84.8251 %
Root relative squared error             84.235  %
Total Number of Instances             1215     

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
  1         0.009      0.989     1         0.995      0.995    eng
  0.979     0.001      0.995     0.979     0.987      0.994    swe
  0.973     0.003      0.984     0.973     0.979      0.996    nor
  0.946     0.005      0.935     0.946     0.941      0.975    dan
  0.989     0          1         0.989     0.995      0.997    fin

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 562   0   0   0   0 |   a = eng
   3 183   0   1   0 |   b = swe
   1   0 183   4   0 |   c = nor
   1   1   3  87   0 |   d = dan
   1   0   0   1 184 |   e = fin

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of 
> text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification 
> (logistic support vector models) feature selection and normalization of token 
> freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new 
> File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of 
> each registred language in the language to train. Above example pass this 
> test:
> (testEquals is the same as assertEquals, just not required. Only one of them 
> fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the 
> document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs 
> for now. I'll try do more work on considering the language trees when 
> classifying.
> It takes a bit of time and RAM to build the training data, so the patch 
> contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-826) Language detector

Reply via email to