Re: Is this a typical OpenNLP tokenization issue?

Suneel Marthi Thu, 29 Jun 2017 17:31:08 -0700

On Thu, Jun 29, 2017 at 8:07 PM, Ling <[email protected]> wrote:

> Hi, Jörn:
>
> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> included the Maven 1.8 version in my POM file, then do I still need to
> download the models separately? And I can't find those model files. For
> example, to do a simple test on tokenization model,
>


Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
you would go to DL4J first and revert back to OpenNLP if all u want to do
is basic text processing.

The model files (1.5 models) are presently at -
http://opennlp.sourceforge.net/models-1.5/



>
> InputStream is = new FileInputStream("en-token.bin");
>
> Do I have to download the en-token.bin separately? I am working in a maven
> projects. Thank you


Yes, the models need to be downloaded separately.

We finally got approval from Apache Foundation to distribute OpenNLP models
thru Apache, following the upcoming 1.8.1 release we should be distributing
updated 1.8.1 models too once we hash out the details for doing that.


> .
>
> Ling
>
>
> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]>
> wrote:
>
> > Long chain, yes, then you probably use the SourceForge tokenization
> > model that was trained on some old news.
> >
> > We usually don't consider mistakes the models do as bugs because we
> > can't do much about it other than suggesting to use models that fit
> > your data very well and even in that case models can be wrong
> > sometimes.
> >
> > If there is something we can do here to reduce the error rate then we
> > are very happy to get that as a contribution or just pointed out.
> >
> > Jörn
> >
> > On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
> > > Hi, Jörn:
> > >
> > > I am using a Deeplearning4j, which uses org.apache.uima library I
> think.
> > > And then UIMA uses openNLP. Probably that's what happens.
> > >
> > > So it isn't openNLP's original problem? Thank you.
> > >
> > > Ling
> > >
> > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <[email protected]>
> > wrote:
> > >
> > >> Hello,
> > >>
> > >> which model are you using? Did you train it yourself?
> > >>
> > >> Jörn
> > >>
> > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> wrote:
> > >> > Hi, all:
> > >> >
> > >> > I am testing openNLP and found some significant tokenization issue
> > >> > involving punctuation.
> > >> >
> > >> > Thank you Costco!
> > >> > i love costco!
> > >> > I love Costco!!
> > >> > FUCK IKEA.
> > >> >
> > >> > In all these cases, the last punctuation is not split so "Costco!"
> and
> > >> > "IKEA." are treated as one token. This looks like a systematic
> > problem.
> > >> > Before I file an issue on OpenNLP project, I want to make sure this
> > issue
> > >> > is true coming from the library.
> > >> >
> > >> > Does any of you encounter similar problem? Thanks.
> > >>
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to