Language detector ----------------- Key: LUCENE-826 URL: https://issues.apache.org/jira/browse/LUCENE-826 Project: Lucene - Java Issue Type: New Feature Reporter: Karl Wettin Assigned To: Karl Wettin
A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. Initialized like this: {code} LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root")); root.addBranch("uralic"); root.addBranch("fino-ugric", "uralic"); root.addBranch("ugric", "uralic"); root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); root.addBranch("proto-indo european"); root.addBranch("germanic", "proto-indo european"); root.addBranch("northern germanic", "germanic"); root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); root.addBranch("west germanic", "germanic"); root.addLanguage("west germanic", "eng", "english", "en", "UK"); root.mkdirs(); LanguageClassifier classifier = new LanguageClassifier(root); if (!new File(root.getDataPath(), "trainingData.arff").exists()) { classifier.compileTrainingData(); // from wikipedia } classifier.buildClassifier(); {code} Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test: (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.) {code} assertEquals("swe", classifier.classify(sweden_in_swedish).getISO()); testEquals("swe", classifier.classify(norway_in_swedish).getISO()); testEquals("swe", classifier.classify(denmark_in_swedish).getISO()); testEquals("swe", classifier.classify(finland_in_swedish).getISO()); testEquals("swe", classifier.classify(uk_in_swedish).getISO()); testEquals("nor", classifier.classify(sweden_in_norwegian).getISO()); assertEquals("nor", classifier.classify(norway_in_norwegian).getISO()); testEquals("nor", classifier.classify(denmark_in_norwegian).getISO()); testEquals("nor", classifier.classify(finland_in_norwegian).getISO()); testEquals("nor", classifier.classify(uk_in_norwegian).getISO()); testEquals("fin", classifier.classify(sweden_in_finnish).getISO()); testEquals("fin", classifier.classify(norway_in_finnish).getISO()); testEquals("fin", classifier.classify(denmark_in_finnish).getISO()); assertEquals("fin", classifier.classify(finland_in_finnish).getISO()); testEquals("fin", classifier.classify(uk_in_finnish).getISO()); testEquals("dan", classifier.classify(sweden_in_danish).getISO()); // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small. testEquals("dan", classifier.classify(norway_in_danish).getISO()); assertEquals("dan", classifier.classify(denmark_in_danish).getISO()); testEquals("dan", classifier.classify(finland_in_danish).getISO()); testEquals("dan", classifier.classify(uk_in_danish).getISO()); testEquals("eng", classifier.classify(sweden_in_english).getISO()); testEquals("eng", classifier.classify(norway_in_english).getISO()); testEquals("eng", classifier.classify(denmark_in_english).getISO()); testEquals("eng", classifier.classify(finland_in_english).getISO()); assertEquals("eng", classifier.classify(uk_in_english).getISO()); {code} I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying. It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]