[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478691 ]
Karl Wettin commented on LUCENE-826: ------------------------------------ Some performance in numbers: using only 160+ character long paragraphs as training data I get these results from a 10-fold cross validation: Time taken to build model: 2.12 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 1199 98.6831 % Incorrectly Classified Instances 16 1.3169 % Kappa statistic 0.9814 Mean absolute error 0.2408 Root mean squared error 0.3173 Relative absolute error 84.8251 % Root relative squared error 84.235 % Total Number of Instances 1215 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0.009 0.989 1 0.995 0.995 eng 0.979 0.001 0.995 0.979 0.987 0.994 swe 0.973 0.003 0.984 0.973 0.979 0.996 nor 0.946 0.005 0.935 0.946 0.941 0.975 dan 0.989 0 1 0.989 0.995 0.997 fin === Confusion Matrix === a b c d e <-- classified as 562 0 0 0 0 | a = eng 3 183 0 1 0 | b = swe 1 0 183 4 0 | c = nor 1 1 3 87 0 | d = dan 1 0 0 1 184 | e = fin > Language detector > ----------------- > > Key: LUCENE-826 > URL: https://issues.apache.org/jira/browse/LUCENE-826 > Project: Lucene - Java > Issue Type: New Feature > Reporter: Karl Wettin > Assigned To: Karl Wettin > Attachments: ld.tar.gz > > > A formula 1A token/ngram-based language detector. Requires a paragraph of > text to avoid false positive classifications. > Depends on contrib/analyzers/ngrams for tokenization, Weka for classification > (logistic support vector models) feature selection and normalization of token > freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. > Initialized like this: > {code} > LanguageRoot root = new LanguageRoot(new > File("documentClassifier/language root")); > root.addBranch("uralic"); > root.addBranch("fino-ugric", "uralic"); > root.addBranch("ugric", "uralic"); > root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); > root.addBranch("proto-indo european"); > root.addBranch("germanic", "proto-indo european"); > root.addBranch("northern germanic", "germanic"); > root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); > root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); > root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); > root.addBranch("west germanic", "germanic"); > root.addLanguage("west germanic", "eng", "english", "en", "UK"); > root.mkdirs(); > LanguageClassifier classifier = new LanguageClassifier(root); > if (!new File(root.getDataPath(), "trainingData.arff").exists()) { > classifier.compileTrainingData(); // from wikipedia > } > classifier.buildClassifier(); > {code} > Training set build from Wikipedia is the pages describing the home country of > each registred language in the language to train. Above example pass this > test: > (testEquals is the same as assertEquals, just not required. Only one of them > fail, see comment.) > {code} > assertEquals("swe", classifier.classify(sweden_in_swedish).getISO()); > testEquals("swe", classifier.classify(norway_in_swedish).getISO()); > testEquals("swe", classifier.classify(denmark_in_swedish).getISO()); > testEquals("swe", classifier.classify(finland_in_swedish).getISO()); > testEquals("swe", classifier.classify(uk_in_swedish).getISO()); > testEquals("nor", classifier.classify(sweden_in_norwegian).getISO()); > assertEquals("nor", classifier.classify(norway_in_norwegian).getISO()); > testEquals("nor", classifier.classify(denmark_in_norwegian).getISO()); > testEquals("nor", classifier.classify(finland_in_norwegian).getISO()); > testEquals("nor", classifier.classify(uk_in_norwegian).getISO()); > testEquals("fin", classifier.classify(sweden_in_finnish).getISO()); > testEquals("fin", classifier.classify(norway_in_finnish).getISO()); > testEquals("fin", classifier.classify(denmark_in_finnish).getISO()); > assertEquals("fin", classifier.classify(finland_in_finnish).getISO()); > testEquals("fin", classifier.classify(uk_in_finnish).getISO()); > testEquals("dan", classifier.classify(sweden_in_danish).getISO()); > // it is ok that this fails. dan and nor are very similar, and the > document about norway in danish is very small. > testEquals("dan", classifier.classify(norway_in_danish).getISO()); > assertEquals("dan", classifier.classify(denmark_in_danish).getISO()); > testEquals("dan", classifier.classify(finland_in_danish).getISO()); > testEquals("dan", classifier.classify(uk_in_danish).getISO()); > testEquals("eng", classifier.classify(sweden_in_english).getISO()); > testEquals("eng", classifier.classify(norway_in_english).getISO()); > testEquals("eng", classifier.classify(denmark_in_english).getISO()); > testEquals("eng", classifier.classify(finland_in_english).getISO()); > assertEquals("eng", classifier.classify(uk_in_english).getISO()); > {code} > I don't know how well it works on lots of lanugages, but this fits my needs > for now. I'll try do more work on considering the language trees when > classifying. > It takes a bit of time and RAM to build the training data, so the patch > contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]