Re: Is this a typical OpenNLP tokenization issue?

Ling Thu, 29 Jun 2017 17:08:35 -0700

Hi, Jörn:

I want to directly use openNLP, instead of deeplearning4j and UIMA. I
included the Maven 1.8 version in my POM file, then do I still need to
download the models separately? And I can't find those model files. For
example, to do a simple test on tokenization model,


InputStream is = new FileInputStream("en-token.bin");

Do I have to download the en-token.bin separately? I am working in a maven
projects. Thank you.

Ling


On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]> wrote:

> Long chain, yes, then you probably use the SourceForge tokenization
> model that was trained on some old news.
>
> We usually don't consider mistakes the models do as bugs because we
> can't do much about it other than suggesting to use models that fit
> your data very well and even in that case models can be wrong
> sometimes.
>
> If there is something we can do here to reduce the error rate then we
> are very happy to get that as a contribution or just pointed out.
>
> Jörn
>
> On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
> > Hi, Jörn:
> >
> > I am using a Deeplearning4j, which uses org.apache.uima library I think.
> > And then UIMA uses openNLP. Probably that's what happens.
> >
> > So it isn't openNLP's original problem? Thank you.
> >
> > Ling
> >
> > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <[email protected]>
> wrote:
> >
> >> Hello,
> >>
> >> which model are you using? Did you train it yourself?
> >>
> >> Jörn
> >>
> >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> wrote:
> >> > Hi, all:
> >> >
> >> > I am testing openNLP and found some significant tokenization issue
> >> > involving punctuation.
> >> >
> >> > Thank you Costco!
> >> > i love costco!
> >> > I love Costco!!
> >> > FUCK IKEA.
> >> >
> >> > In all these cases, the last punctuation is not split so "Costco!" and
> >> > "IKEA." are treated as one token. This looks like a systematic
> problem.
> >> > Before I file an issue on OpenNLP project, I want to make sure this
> issue
> >> > is true coming from the library.
> >> >
> >> > Does any of you encounter similar problem? Thanks.
> >>
>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to