[jira] [Commented] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise

ASF GitHub Bot (Jira) Tue, 03 Jan 2023 06:47:06 -0800


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654040#comment-17654040
 ]


ASF GitHub Bot commented on OPENNLP-1182:
-----------------------------------------

jzonthemtn commented on PR #482:
URL: https://github.com/apache/opennlp/pull/482#issuecomment-1369850475

   I tested this using the 100k English Leipzig file using the command 
`./opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ 
-sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt`. I also tested 
it with an empty directory to verify the exception message.




> LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
> ---------------------------------------------------------------------------
>
>                 Key: OPENNLP-1182
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1182
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Language Detector
>    Affects Versions: 1.8.4
>            Reporter: Steven Rowe
>            Assignee: Atita Arora
>            Priority: Minor
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't 
> actually do anything at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora 
> collection at folder leipzig-train/ to the default Language Detector format, 
> by creating groups of 5 sentences as documents and limiting to 10000 
> documents per language. Them, it shuffles the result and select the first 
> 100000 lines as train corpus and the last 20000 as evaluation corpus:
> {noformat}                                    
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ 
> -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > 
> leipzig_shuf.txt
> $ head -100000 < leipzig_shuf.txt > leipzig.train
> $ tail -20000 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1182) LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise

Reply via email to