[ https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Zemerick closed OPENNLP-1182. ---------------------------------- > Improve error handling in LanguageDetectorConverterTool > ------------------------------------------------------- > > Key: OPENNLP-1182 > URL: https://issues.apache.org/jira/browse/OPENNLP-1182 > Project: OpenNLP > Issue Type: Bug > Components: Language Detector > Affects Versions: 1.8.4 > Reporter: Steven Rowe > Assignee: Atita Arora > Priority: Minor > Fix For: 2.1.1 > > > Contrary to the docs (see below), LanguageDetectorConverterTool doesn't > actually do anything at all; the class is empty. > {quote} > The following sequence of commands shows how to convert the Leipzig Corpora > collection at folder leipzig-train/ to the default Language Detector format, > by creating groups of 5 sentences as documents and limiting to 10000 > documents per language. Them, it shuffles the result and select the first > 100000 lines as train corpus and the last 20000 as evaluation corpus: > {noformat} > $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ > -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt > $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > > leipzig_shuf.txt > $ head -100000 < leipzig_shuf.txt > leipzig.train > $ tail -20000 < leipzig_shuf.txt > leipzig.eval > {noformat} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)