Re: Is this a typical OpenNLP tokenization issue?

Joern Kottmann Thu, 29 Jun 2017 10:43:09 -0700

Long chain, yes, then you probably use the SourceForge tokenization
model that was trained on some old news.


We usually don't consider mistakes the models do as bugs because we
can't do much about it other than suggesting to use models that fit
your data very well and even in that case models can be wrong
sometimes.

If there is something we can do here to reduce the error rate then we
are very happy to get that as a contribution or just pointed out.

Jörn

On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
> Hi, Jörn:
>
> I am using a Deeplearning4j, which uses org.apache.uima library I think.
> And then UIMA uses openNLP. Probably that's what happens.
>
> So it isn't openNLP's original problem? Thank you.
>
> Ling
>
> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <[email protected]> wrote:
>
>> Hello,
>>
>> which model are you using? Did you train it yourself?
>>
>> Jörn
>>
>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> wrote:
>> > Hi, all:
>> >
>> > I am testing openNLP and found some significant tokenization issue
>> > involving punctuation.
>> >
>> > Thank you Costco!
>> > i love costco!
>> > I love Costco!!
>> > FUCK IKEA.
>> >
>> > In all these cases, the last punctuation is not split so "Costco!" and
>> > "IKEA." are treated as one token. This looks like a systematic problem.
>> > Before I file an issue on OpenNLP project, I want to make sure this issue
>> > is true coming from the library.
>> >
>> > Does any of you encounter similar problem? Thanks.
>>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to