[ https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14860187#comment-14860187 ]
Joern Kottmann commented on OPENNLP-777: ---------------------------------------- Great. Here is the command I used to train it: bin/opennlp DoccatTrainer.leipzig -sentencesDir /home/blue/Documents/langtrain/ -model langid-ngram.bin -lang mul -params lang/ml/NaiveBayesTrainerParams.txt And here are the files I used: afr_web_2013_100K-sentences.txt lit_newscrawl_2011_100K-sentences.txt ara_web_2011_100K-sentences.txt mal_newscrawl_2011_100K-sentences.txt bak_newscrawl_2011_100K-sentences.txt mar_newscrawl_2011_100K-sentences.txt bel_news_2011_100K-sentences.txt mkd_newscrwal_2011_100K-sentences.txt ben_newscrawl_2011_100K-sentences.txt mlt_web_2012_100K-sentences.txt bos_newscrawl_2011_100K-sentences.txt mri_web_2011_100K-sentences.txt bul_newscrawl_2011_100K-sentences.txt msa_newscrwal_2011_100K-sentences.txt cat_newscrawl_2011_100K-sentences.txt nep_news_2010_100K-sentences.txt ces_web_2012_100K-sentences.txt nld_mixed_2012_100K-sentences.txt cmn_wikipedia_2012_100K-sentences.txt nob_news_2013_100K-sentences.txt dan_mixed_2014_100K-sentences.txt pol_newscrawl_2011_100K-sentences.txt deu_news_2010_100K-sentences.txt por_newscrawl_2011_100K-sentences.txt ell_web_2011_100K-sentences.txt pus_newscrawl_2011_100K-sentences.txt eng_news_2010_100K-sentences.txt ron_web_2011_100K-sentences.txt epo_web_2012_100K-sentences.txt rus_news_2010_100K-sentences.txt est_newscrawl_2011_100K-sentences.txt slk_newscrawl_2011_100K-sentences.txt eus_newscrawl_2012_100K-sentences.txt slv_newscrawl_2011_100K-sentences.txt fao_web_2013_100K-sentences.txt som_newscrawl_2011_100K-sentences.txt fas_newscrawl_2011_100K-sentences.txt spa_news_2011_100K-sentences.txt fin_newscrawl_2011_100K-sentences.txt srp_wikipedia_2010_100K-sentences.txt fra_news_2010_100K-sentences.txt swe_news_2007_100K-sentences.txt glg_wikipedia_2012_100K-sentences.txt tam_newscrawl_2011_100K-sentences.txt hin_newscrawl_2012_100K-sentences.txt tat_mixed_2015_100K-sentences.txt hrv_newscrawl_2011_100K-sentences.txt tel_newscrawl_2011_100K-sentences.txt hun_mixed_2012_100K-sentences.txt tgk_newscrawl_2011_100K-sentences.txt hye_newscrawl_2011_100K-sentences.txt tgl_newscrwal_2011_100K-sentences.txt ind_web_2012_100K-sentences.txt tha_newscrawl_2011_100K-sentences.txt isl_newscrawl_2011_100K-sentences.txt tur_newscrawl_2011_100K-sentences.txt ita_web_2011_100K-sentences.txt ukr_web_2012_100K-sentences.txt jpn_news_2005-2008_100K-sentences.txt urd_newscrwal_2011_100K-sentences.txt kat_newscrawl_2011_100K-sentences.txt uzb_newscrawl_2011_100K-sentences.txt kaz_newscrawl_2011_100K-sentences.txt vie_newscrwal_2011_100K-sentences.txt kir_newscrawl_2011_100K-sentences.txt vol_wikipedia_2011_100K-sentences.txt kor_news_2007_100K-sentences.txt zho_news_2007-2009_100K-sentences.txt lav_newscrawl_2011_100K-sentences.txt zul_mixed_2013_100K-sentences.txt You can download them from here: http://corpora2.informatik.uni-leipzig.de/download.html The resulting language detection works rather well for texts that have at least a few words. > Naive Bayesian Classifier > ------------------------- > > Key: OPENNLP-777 > URL: https://issues.apache.org/jira/browse/OPENNLP-777 > Project: OpenNLP > Issue Type: New Feature > Components: Machine Learning > Environment: J2SE 1.5 and above > Reporter: Cohan Sujay Carlos > Assignee: Tommaso Teofili > Priority: Minor > Labels: NBClassifier, bayes, bayesian, classifier, multinomial, > naive, patch > Attachments: D1TopicClassifierTrainingDemoNB.java, > D1TopicClassifierUsageDemoNB.java, NaiveBayesCorrectnessTest.java, > naive-bayesian-classifier-for-opennlp-1.6.0-rc6-with-test-cases.patch, > prep-attach-test-case-for-naive-bayesian-classifier-for-opennlp-1.6.0-rc6.patch, > topics.train > > Original Estimate: 504h > Remaining Estimate: 504h > > I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it > lacks one at present). > Implementation details: We have a production-hardened piece of Java code for > a multinomial Naive Bayesian classifier (with default Laplace smoothing) that > we'd like to contribute. The code is Java 1.5 compatible. I'd have to write > an adapter to make the interface compatible with the ME classifier in > OpenNLP. I expect the patch to be available 1 to 3 weeks from now. > Below is the email trail of a discussion in the dev mailing list around this > dated May 19th, 2015. > <snip> > Tommaso Teofili via opennlp.apache.org > to dev > Hi Cohan, > I think that'd be a very valuable contribution, as NB is one of the > foundation algorithms, often used as basis for comparisons. > It would be good if you could create a Jira issue and provide more details > about the implementation and, eventually, a patch. > Thanks and regards, > Tommaso > </snip> > 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos > > I have a question for the OpenNLP project team. > > > > I was wondering if there is a Naive Bayesian classifier implementation in > > OpenNLP that I've not come across, or if there are plans to implement one. > > > > If it is the latter, I should love to contribute an implementation. > > > > There is an ME classifier already available in OpenNLP, of course, but I > > felt that there was an unmet need for a Naive Bayesian (NB) classifier > > implementation to be offered as well. > > > > An NB classifier could be bootstrapped up with partially labelled training > > data as explained in the Nigam, McCallum, et al paper of 2000 "Text > > Classification from Labeled and Unlabeled Documents using EM". > > > > So, if there isn't an NB code base out there already, I'd be happy to > > contribute a very solid implementation that we've used in production for a > > good 5 years. > > > > I'd have to adapt it to load the same training data format as the ME > > classifier, but I guess that shouldn't be very difficult to do. > > > > I was wondering if there was some interest in adding an NB implementation > > and I'd love to know who could I coordinate with if there is? > > > > Cohan Sujay Carlos > > CEO, Aiaioo Labs, India -- This message was sent by Atlassian JIRA (v6.3.4#6332)