[jira] [Closed] (OPENNLP-1182) Improve error handling in LanguageDetectorConverterTool

Jeff Zemerick (Jira) Tue, 03 Jan 2023 06:57:04 -0800


     [ 
https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeff Zemerick closed OPENNLP-1182.
----------------------------------

> Improve error handling in LanguageDetectorConverterTool
> -------------------------------------------------------
>
>                 Key: OPENNLP-1182
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1182
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Language Detector
>    Affects Versions: 1.8.4
>            Reporter: Steven Rowe
>            Assignee: Atita Arora
>            Priority: Minor
>             Fix For: 2.1.1
>
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't 
> actually do anything at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora 
> collection at folder leipzig-train/ to the default Language Detector format, 
> by creating groups of 5 sentences as documents and limiting to 10000 
> documents per language. Them, it shuffles the result and select the first 
> 100000 lines as train corpus and the last 20000 as evaluation corpus:
> {noformat}                                    
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ 
> -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > 
> leipzig_shuf.txt
> $ head -100000 < leipzig_shuf.txt > leipzig.train
> $ tail -20000 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (OPENNLP-1182) Improve error handling in LanguageDetectorConverterTool

Reply via email to