[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055374#comment-13055374 ]
Jan Høydahl commented on LUCENE-826: ------------------------------------ Reviving this issue - would be interesting to arrive at a proposal whether this code could replace Tika's existing languageIdentifier. We still need to solve the case with small texts. I'm thinking of a hybrid solution where we fallback to a dictionary based detector for small texts, i.e. based on Ooo dictionaries. > Language detector > ----------------- > > Key: LUCENE-826 > URL: https://issues.apache.org/jira/browse/LUCENE-826 > Project: Lucene - Java > Issue Type: New Feature > Reporter: Karl Wettin > Assignee: Karl Wettin > Attachments: ld.tar.gz, ld.tar.gz > > > A formula 1A token/ngram-based language detector. Requires a paragraph of > text to avoid false positive classifications. > Depends on contrib/analyzers/ngrams for tokenization, Weka for classification > (logistic support vector models) feature selection and normalization of token > freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. > Initialized like this: > {code} > LanguageRoot root = new LanguageRoot(new > File("documentClassifier/language root")); > root.addBranch("uralic"); > root.addBranch("fino-ugric", "uralic"); > root.addBranch("ugric", "uralic"); > root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); > root.addBranch("proto-indo european"); > root.addBranch("germanic", "proto-indo european"); > root.addBranch("northern germanic", "germanic"); > root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); > root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); > root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); > root.addBranch("west germanic", "germanic"); > root.addLanguage("west germanic", "eng", "english", "en", "UK"); > root.mkdirs(); > LanguageClassifier classifier = new LanguageClassifier(root); > if (!new File(root.getDataPath(), "trainingData.arff").exists()) { > classifier.compileTrainingData(); // from wikipedia > } > classifier.buildClassifier(); > {code} > Training set build from Wikipedia is the pages describing the home country of > each registred language in the language to train. Above example pass this > test: > (testEquals is the same as assertEquals, just not required. Only one of them > fail, see comment.) > {code} > assertEquals("swe", classifier.classify(sweden_in_swedish).getISO()); > testEquals("swe", classifier.classify(norway_in_swedish).getISO()); > testEquals("swe", classifier.classify(denmark_in_swedish).getISO()); > testEquals("swe", classifier.classify(finland_in_swedish).getISO()); > testEquals("swe", classifier.classify(uk_in_swedish).getISO()); > testEquals("nor", classifier.classify(sweden_in_norwegian).getISO()); > assertEquals("nor", classifier.classify(norway_in_norwegian).getISO()); > testEquals("nor", classifier.classify(denmark_in_norwegian).getISO()); > testEquals("nor", classifier.classify(finland_in_norwegian).getISO()); > testEquals("nor", classifier.classify(uk_in_norwegian).getISO()); > testEquals("fin", classifier.classify(sweden_in_finnish).getISO()); > testEquals("fin", classifier.classify(norway_in_finnish).getISO()); > testEquals("fin", classifier.classify(denmark_in_finnish).getISO()); > assertEquals("fin", classifier.classify(finland_in_finnish).getISO()); > testEquals("fin", classifier.classify(uk_in_finnish).getISO()); > testEquals("dan", classifier.classify(sweden_in_danish).getISO()); > // it is ok that this fails. dan and nor are very similar, and the > document about norway in danish is very small. > testEquals("dan", classifier.classify(norway_in_danish).getISO()); > assertEquals("dan", classifier.classify(denmark_in_danish).getISO()); > testEquals("dan", classifier.classify(finland_in_danish).getISO()); > testEquals("dan", classifier.classify(uk_in_danish).getISO()); > testEquals("eng", classifier.classify(sweden_in_english).getISO()); > testEquals("eng", classifier.classify(norway_in_english).getISO()); > testEquals("eng", classifier.classify(denmark_in_english).getISO()); > testEquals("eng", classifier.classify(finland_in_english).getISO()); > assertEquals("eng", classifier.classify(uk_in_english).getISO()); > {code} > I don't know how well it works on lots of lanugages, but this fits my needs > for now. I'll try do more work on considering the language trees when > classifying. > It takes a bit of time and RAM to build the training data, so the patch > contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org