[
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478712
]
Karl Wettin commented on LUCENE-826:
------------------------------------
Ahhh, I could not let be go without some more tests. Added a bunch of languages
and it seems as it works quite splendid. Again, 10-cross fold validation output
on 160+ characters long paragraphs:
Time taken to build model: 45.51 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 5566 98.8808 %
Incorrectly Classified Instances 63 1.1192 %
Kappa statistic 0.9874
Mean absolute error 0.139
Root mean squared error 0.2555
Relative absolute error 93.6301 %
Root relative squared error 93.7791 %
Total Number of Instances 5629
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.996 0.003 0.988 0.996 0.992 0.997 eng
0.988 0 0.998 0.988 0.993 0.995 swe
0.984 0.002 0.982 0.984 0.983 0.996 spa
0.988 0 0.995 0.988 0.992 0.997 fre
0.979 0.001 0.982 0.979 0.981 0.992 nld
0.97 0.002 0.97 0.97 0.97 0.993 nor
1 0 1 1 1 1 afr
0.914 0.001 0.946 0.914 0.93 0.992 dan
0.986 0.001 0.981 0.986 0.984 0.999 pot
0.998 0.001 0.993 0.998 0.995 0.999 fin
0.99 0.001 0.993 0.99 0.992 0.999 ita
0.998 0 0.998 0.998 0.998 0.999 ger
=== Confusion Matrix ===
a b c d e f g h i j k l <-- classified as
1044 1 1 0 0 0 0 0 1 1 0 0 | a = eng
2 425 0 0 2 0 0 0 0 0 1 0 | b = swe
0 0 434 1 1 0 0 0 5 0 0 0 | c = spa
2 0 0 418 0 0 0 0 0 1 0 2 | d = fre
4 0 2 0 333 0 0 0 0 0 1 0 | e = nld
1 0 0 0 0 322 0 7 1 0 1 0 | f = nor
0 0 0 0 0 0 230 0 0 0 0 0 | g = afr
1 0 0 0 2 10 0 139 0 0 0 0 | h = dan
0 0 5 0 0 0 0 0 362 0 0 0 | i = pot
0 0 0 0 0 0 0 1 0 440 0 0 | j = fin
2 0 0 0 1 0 0 0 0 1 417 0 | k = ita
1 0 0 1 0 0 0 0 0 0 0 1002 | l = ger
root.addBranch("uralic");
root.addBranch("uralic", "fino-ugric");
root.addBranch("uralic", "ugric");
//root.addLanguage("hungarian", "ugric");
root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
//root.addLanguage("sami", "fino-ugric");
//root.addLanguage("estonian", "fino-ugric");
//root.addLanguage("livonian", "fino-ugric");
root.addBranch("proto-indo european");
root.addBranch("proto-indo european", "italic");
root.addBranch("italic", "latino-faliscan");
root.addBranch("latino-faliscan", "latin");
root.addLanguage("latin", "ita", "italian", "it", "Italia");
root.addLanguage("latin", "fre", "french", "fr", "France");
root.addLanguage("latin", "pot", "portugese", "pt", "Portugal");
root.addLanguage("latin", "spa", "spanish", "es", "Espa%C3%B1a");
root.addBranch("proto-indo european", "germanic");
root.addBranch("germanic", "northern germanic");
root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
root.addBranch("germanic", "west germanic");
root.addLanguage("west germanic", "eng", "english", "en", "UK");
root.addLanguage("west germanic", "ger", "german", "de", "Deutschland");
root.addBranch("west germanic", "middle dutch");
root.addLanguage("middle dutch", "nld", "dutch", "nl", "Nederland");
root.addLanguage("middle dutch", "afr", "afrikaans", "af", "Nederland");
> Language detector
> -----------------
>
> Key: LUCENE-826
> URL: https://issues.apache.org/jira/browse/LUCENE-826
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Karl Wettin
> Assigned To: Karl Wettin
> Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of
> text to avoid false positive classifications.
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification
> (logistic support vector models) feature selection and normalization of token
> freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
> LanguageRoot root = new LanguageRoot(new
> File("documentClassifier/language root"));
> root.addBranch("uralic");
> root.addBranch("fino-ugric", "uralic");
> root.addBranch("ugric", "uralic");
> root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
> root.addBranch("proto-indo european");
> root.addBranch("germanic", "proto-indo european");
> root.addBranch("northern germanic", "germanic");
> root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
> root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
> root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
> root.addBranch("west germanic", "germanic");
> root.addLanguage("west germanic", "eng", "english", "en", "UK");
> root.mkdirs();
> LanguageClassifier classifier = new LanguageClassifier(root);
> if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
> classifier.compileTrainingData(); // from wikipedia
> }
> classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of
> each registred language in the language to train. Above example pass this
> test:
> (testEquals is the same as assertEquals, just not required. Only one of them
> fail, see comment.)
> {code}
> assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
> testEquals("swe", classifier.classify(norway_in_swedish).getISO());
> testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
> testEquals("swe", classifier.classify(finland_in_swedish).getISO());
> testEquals("swe", classifier.classify(uk_in_swedish).getISO());
> testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
> assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
> testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
> testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
> testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
> testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
> testEquals("fin", classifier.classify(norway_in_finnish).getISO());
> testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
> assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
> testEquals("fin", classifier.classify(uk_in_finnish).getISO());
> testEquals("dan", classifier.classify(sweden_in_danish).getISO());
> // it is ok that this fails. dan and nor are very similar, and the
> document about norway in danish is very small.
> testEquals("dan", classifier.classify(norway_in_danish).getISO());
> assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
> testEquals("dan", classifier.classify(finland_in_danish).getISO());
> testEquals("dan", classifier.classify(uk_in_danish).getISO());
> testEquals("eng", classifier.classify(sweden_in_english).getISO());
> testEquals("eng", classifier.classify(norway_in_english).getISO());
> testEquals("eng", classifier.classify(denmark_in_english).getISO());
> testEquals("eng", classifier.classify(finland_in_english).getISO());
> assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs
> for now. I'll try do more work on considering the language trees when
> classifying.
> It takes a bit of time and RAM to build the training data, so the patch
> contains a pre-compiled arff-file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]