Re: Is this a typical OpenNLP tokenization issue?

Gary Underwood Thu, 29 Jun 2017 17:46:57 -0700

The models are separate. They can be downloaded from 
http://opennlp.sourceforge.net/models-1.5/ 
<http://opennlp.sourceforge.net/models-1.5/>
Gary Underwood
[email protected]




> On Jun 29, 2017, at 8:07 PM, Ling <[email protected]> wrote:
> 
> Hi, Jörn:
> 
> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> included the Maven 1.8 version in my POM file, then do I still need to
> download the models separately? And I can't find those model files. For
> example, to do a simple test on tokenization model,
> 
> InputStream is = new FileInputStream("en-token.bin");
> 
> Do I have to download the en-token.bin separately? I am working in a maven
> projects. Thank you.
> 
> Ling
> 
> 
> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]> wrote:
> 
>> Long chain, yes, then you probably use the SourceForge tokenization
>> model that was trained on some old news.
>> 
>> We usually don't consider mistakes the models do as bugs because we
>> can't do much about it other than suggesting to use models that fit
>> your data very well and even in that case models can be wrong
>> sometimes.
>> 
>> If there is something we can do here to reduce the error rate then we
>> are very happy to get that as a contribution or just pointed out.
>> 
>> Jörn
>> 
>> On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
>>> Hi, Jörn:
>>> 
>>> I am using a Deeplearning4j, which uses org.apache.uima library I think.
>>> And then UIMA uses openNLP. Probably that's what happens.
>>> 
>>> So it isn't openNLP's original problem? Thank you.
>>> 
>>> Ling
>>> 
>>> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <[email protected]>
>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> which model are you using? Did you train it yourself?
>>>> 
>>>> Jörn
>>>> 
>>>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> wrote:
>>>>> Hi, all:
>>>>> 
>>>>> I am testing openNLP and found some significant tokenization issue
>>>>> involving punctuation.
>>>>> 
>>>>> Thank you Costco!
>>>>> i love costco!
>>>>> I love Costco!!
>>>>> FUCK IKEA.
>>>>> 
>>>>> In all these cases, the last punctuation is not split so "Costco!" and
>>>>> "IKEA." are treated as one token. This looks like a systematic
>> problem.
>>>>> Before I file an issue on OpenNLP project, I want to make sure this
>> issue
>>>>> is true coming from the library.
>>>>> 
>>>>> Does any of you encounter similar problem? Thanks.
>>>> 
>>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to