[ https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654040#comment-17654040 ]
ASF GitHub Bot commented on OPENNLP-1182: ----------------------------------------- jzonthemtn commented on PR #482: URL: https://github.com/apache/opennlp/pull/482#issuecomment-1369850475 I tested this using the 100k English Leipzig file using the command `./opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt`. I also tested it with an empty directory to verify the exception message. > LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise > --------------------------------------------------------------------------- > > Key: OPENNLP-1182 > URL: https://issues.apache.org/jira/browse/OPENNLP-1182 > Project: OpenNLP > Issue Type: Bug > Components: Language Detector > Affects Versions: 1.8.4 > Reporter: Steven Rowe > Assignee: Atita Arora > Priority: Minor > > Contrary to the docs (see below), LanguageDetectorConverterTool doesn't > actually do anything at all; the class is empty. > {quote} > The following sequence of commands shows how to convert the Leipzig Corpora > collection at folder leipzig-train/ to the default Language Detector format, > by creating groups of 5 sentences as documents and limiting to 10000 > documents per language. Them, it shuffles the result and select the first > 100000 lines as train corpus and the last 20000 as evaluation corpus: > {noformat} > $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ > -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt > $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > > leipzig_shuf.txt > $ head -100000 < leipzig_shuf.txt > leipzig.train > $ tail -20000 < leipzig_shuf.txt > leipzig.eval > {noformat} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)