Re: Is this a typical OpenNLP tokenization issue?

Suneel Marthi Thu, 29 Jun 2017 17:50:55 -0700

Well u could wait until next release for newer models

Sent from my iPhone


> On Jun 29, 2017, at 8:47 PM, Ling <[email protected]> wrote:
> 
> These are my original concerns. In the deeplearning4j, which uses openNLP
> 1.5, they treat "Costco!" and "IKEA." and similar things as one token. Jörn
> said it's due to old Models.
> 
> Thank you Costco!
> i love costco!
> I love Costco!!
> FUCK IKEA.
> 
> On Thu, Jun 29, 2017 at 5:39 PM, Suneel Marthi <[email protected]>
> wrote:
> 
>>> On Thu, Jun 29, 2017 at 8:36 PM, Ling <[email protected]> wrote:
>>> 
>>> Hi, Suneel , that's great. The reason was that I wanted to do something
>> in
>>> DeepLearnig4j and happened to find that openNLP was integrated into it
>>> already. So I just used their API to call openNLP.
>>> 
>>> Is there a set date for next release? Also, are the 1.5 models the same
>> as
>>> the models to be included in the 1.81 release?
>>> 
>> 
>> shuld be some time next week.
>> 
>> if u r talking about the usage by 'models being the same', yes nothing
>> changes in how u invoke the model from ur code.
>> 
>>> 
>>> Thanks.
>>> Ling
>>> 
>>> On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <[email protected]>
>> wrote:
>>> 
>>>>> On Thu, Jun 29, 2017 at 8:07 PM, Ling <[email protected]> wrote:
>>>>> 
>>>>> Hi, Jörn:
>>>>> 
>>>>> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
>>>>> included the Maven 1.8 version in my POM file, then do I still need
>> to
>>>>> download the models separately? And I can't find those model files.
>> For
>>>>> example, to do a simple test on tokenization model,
>>>>> 
>>>> 
>>>> Dl4j is for Deep learning, OpenNLP is for text processing - not sure
>> why
>>>> you would go to DL4J first and revert back to OpenNLP if all u want to
>> do
>>>> is basic text processing.
>>>> 
>>>> The model files (1.5 models) are presently at -
>>>> http://opennlp.sourceforge.net/models-1.5/
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> InputStream is = new FileInputStream("en-token.bin");
>>>>> 
>>>>> Do I have to download the en-token.bin separately? I am working in a
>>>> maven
>>>>> projects. Thank you
>>>> 
>>>> 
>>>> Yes, the models need to be downloaded separately.
>>>> 
>>>> We finally got approval from Apache Foundation to distribute OpenNLP
>>> models
>>>> thru Apache, following the upcoming 1.8.1 release we should be
>>> distributing
>>>> updated 1.8.1 models too once we hash out the details for doing that.
>>>> 
>>>> 
>>>>> .
>>>>> 
>>>>> Ling
>>>>> 
>>>>> 
>>>>> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Long chain, yes, then you probably use the SourceForge tokenization
>>>>>> model that was trained on some old news.
>>>>>> 
>>>>>> We usually don't consider mistakes the models do as bugs because we
>>>>>> can't do much about it other than suggesting to use models that fit
>>>>>> your data very well and even in that case models can be wrong
>>>>>> sometimes.
>>>>>> 
>>>>>> If there is something we can do here to reduce the error rate then
>> we
>>>>>> are very happy to get that as a contribution or just pointed out.
>>>>>> 
>>>>>> Jörn
>>>>>> 
>>>>>>> On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
>>>>>>> Hi, Jörn:
>>>>>>> 
>>>>>>> I am using a Deeplearning4j, which uses org.apache.uima library I
>>>>> think.
>>>>>>> And then UIMA uses openNLP. Probably that's what happens.
>>>>>>> 
>>>>>>> So it isn't openNLP's original problem? Thank you.
>>>>>>> 
>>>>>>> Ling
>>>>>>> 
>>>>>>> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
>>> [email protected]
>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> which model are you using? Did you train it yourself?
>>>>>>>> 
>>>>>>>> Jörn
>>>>>>>> 
>>>>>>>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]>
>> wrote:
>>>>>>>>> Hi, all:
>>>>>>>>> 
>>>>>>>>> I am testing openNLP and found some significant tokenization
>>> issue
>>>>>>>>> involving punctuation.
>>>>>>>>> 
>>>>>>>>> Thank you Costco!
>>>>>>>>> i love costco!
>>>>>>>>> I love Costco!!
>>>>>>>>> FUCK IKEA.
>>>>>>>>> 
>>>>>>>>> In all these cases, the last punctuation is not split so
>>> "Costco!"
>>>>> and
>>>>>>>>> "IKEA." are treated as one token. This looks like a systematic
>>>>>> problem.
>>>>>>>>> Before I file an issue on OpenNLP project, I want to make sure
>>>> this
>>>>>> issue
>>>>>>>>> is true coming from the library.
>>>>>>>>> 
>>>>>>>>> Does any of you encounter similar problem? Thanks.
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to