[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478712 ]
Karl Wettin commented on LUCENE-826: ------------------------------------ Ahhh, I could not let be go without some more tests. Added a bunch of languages and it seems as it works quite splendid. Again, 10-cross fold validation output on 160+ characters long paragraphs: Time taken to build model: 45.51 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 5566 98.8808 % Incorrectly Classified Instances 63 1.1192 % Kappa statistic 0.9874 Mean absolute error 0.139 Root mean squared error 0.2555 Relative absolute error 93.6301 % Root relative squared error 93.7791 % Total Number of Instances 5629 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.996 0.003 0.988 0.996 0.992 0.997 eng 0.988 0 0.998 0.988 0.993 0.995 swe 0.984 0.002 0.982 0.984 0.983 0.996 spa 0.988 0 0.995 0.988 0.992 0.997 fre 0.979 0.001 0.982 0.979 0.981 0.992 nld 0.97 0.002 0.97 0.97 0.97 0.993 nor 1 0 1 1 1 1 afr 0.914 0.001 0.946 0.914 0.93 0.992 dan 0.986 0.001 0.981 0.986 0.984 0.999 pot 0.998 0.001 0.993 0.998 0.995 0.999 fin 0.99 0.001 0.993 0.99 0.992 0.999 ita 0.998 0 0.998 0.998 0.998 0.999 ger === Confusion Matrix === a b c d e f g h i j k l <-- classified as 1044 1 1 0 0 0 0 0 1 1 0 0 | a = eng 2 425 0 0 2 0 0 0 0 0 1 0 | b = swe 0 0 434 1 1 0 0 0 5 0 0 0 | c = spa 2 0 0 418 0 0 0 0 0 1 0 2 | d = fre 4 0 2 0 333 0 0 0 0 0 1 0 | e = nld 1 0 0 0 0 322 0 7 1 0 1 0 | f = nor 0 0 0 0 0 0 230 0 0 0 0 0 | g = afr 1 0 0 0 2 10 0 139 0 0 0 0 | h = dan 0 0 5 0 0 0 0 0 362 0 0 0 | i = pot 0 0 0 0 0 0 0 1 0 440 0 0 | j = fin 2 0 0 0 1 0 0 0 0 1 417 0 | k = ita 1 0 0 1 0 0 0 0 0 0 0 1002 | l = ger root.addBranch("uralic"); root.addBranch("uralic", "fino-ugric"); root.addBranch("uralic", "ugric"); //root.addLanguage("hungarian", "ugric"); root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); //root.addLanguage("sami", "fino-ugric"); //root.addLanguage("estonian", "fino-ugric"); //root.addLanguage("livonian", "fino-ugric"); root.addBranch("proto-indo european"); root.addBranch("proto-indo european", "italic"); root.addBranch("italic", "latino-faliscan"); root.addBranch("latino-faliscan", "latin"); root.addLanguage("latin", "ita", "italian", "it", "Italia"); root.addLanguage("latin", "fre", "french", "fr", "France"); root.addLanguage("latin", "pot", "portugese", "pt", "Portugal"); root.addLanguage("latin", "spa", "spanish", "es", "Espa%C3%B1a"); root.addBranch("proto-indo european", "germanic"); root.addBranch("germanic", "northern germanic"); root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); root.addBranch("germanic", "west germanic"); root.addLanguage("west germanic", "eng", "english", "en", "UK"); root.addLanguage("west germanic", "ger", "german", "de", "Deutschland"); root.addBranch("west germanic", "middle dutch"); root.addLanguage("middle dutch", "nld", "dutch", "nl", "Nederland"); root.addLanguage("middle dutch", "afr", "afrikaans", "af", "Nederland"); > Language detector > ----------------- > > Key: LUCENE-826 > URL: https://issues.apache.org/jira/browse/LUCENE-826 > Project: Lucene - Java > Issue Type: New Feature > Reporter: Karl Wettin > Assigned To: Karl Wettin > Attachments: ld.tar.gz > > > A formula 1A token/ngram-based language detector. Requires a paragraph of > text to avoid false positive classifications. > Depends on contrib/analyzers/ngrams for tokenization, Weka for classification > (logistic support vector models) feature selection and normalization of token > freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. > Initialized like this: > {code} > LanguageRoot root = new LanguageRoot(new > File("documentClassifier/language root")); > root.addBranch("uralic"); > root.addBranch("fino-ugric", "uralic"); > root.addBranch("ugric", "uralic"); > root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); > root.addBranch("proto-indo european"); > root.addBranch("germanic", "proto-indo european"); > root.addBranch("northern germanic", "germanic"); > root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); > root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); > root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); > root.addBranch("west germanic", "germanic"); > root.addLanguage("west germanic", "eng", "english", "en", "UK"); > root.mkdirs(); > LanguageClassifier classifier = new LanguageClassifier(root); > if (!new File(root.getDataPath(), "trainingData.arff").exists()) { > classifier.compileTrainingData(); // from wikipedia > } > classifier.buildClassifier(); > {code} > Training set build from Wikipedia is the pages describing the home country of > each registred language in the language to train. Above example pass this > test: > (testEquals is the same as assertEquals, just not required. Only one of them > fail, see comment.) > {code} > assertEquals("swe", classifier.classify(sweden_in_swedish).getISO()); > testEquals("swe", classifier.classify(norway_in_swedish).getISO()); > testEquals("swe", classifier.classify(denmark_in_swedish).getISO()); > testEquals("swe", classifier.classify(finland_in_swedish).getISO()); > testEquals("swe", classifier.classify(uk_in_swedish).getISO()); > testEquals("nor", classifier.classify(sweden_in_norwegian).getISO()); > assertEquals("nor", classifier.classify(norway_in_norwegian).getISO()); > testEquals("nor", classifier.classify(denmark_in_norwegian).getISO()); > testEquals("nor", classifier.classify(finland_in_norwegian).getISO()); > testEquals("nor", classifier.classify(uk_in_norwegian).getISO()); > testEquals("fin", classifier.classify(sweden_in_finnish).getISO()); > testEquals("fin", classifier.classify(norway_in_finnish).getISO()); > testEquals("fin", classifier.classify(denmark_in_finnish).getISO()); > assertEquals("fin", classifier.classify(finland_in_finnish).getISO()); > testEquals("fin", classifier.classify(uk_in_finnish).getISO()); > testEquals("dan", classifier.classify(sweden_in_danish).getISO()); > // it is ok that this fails. dan and nor are very similar, and the > document about norway in danish is very small. > testEquals("dan", classifier.classify(norway_in_danish).getISO()); > assertEquals("dan", classifier.classify(denmark_in_danish).getISO()); > testEquals("dan", classifier.classify(finland_in_danish).getISO()); > testEquals("dan", classifier.classify(uk_in_danish).getISO()); > testEquals("eng", classifier.classify(sweden_in_english).getISO()); > testEquals("eng", classifier.classify(norway_in_english).getISO()); > testEquals("eng", classifier.classify(denmark_in_english).getISO()); > testEquals("eng", classifier.classify(finland_in_english).getISO()); > assertEquals("eng", classifier.classify(uk_in_english).getISO()); > {code} > I don't know how well it works on lots of lanugages, but this fits my needs > for now. I'll try do more work on considering the language trees when > classifying. > It takes a bit of time and RAM to build the training data, so the patch > contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]