Re: Is this a typical OpenNLP tokenization issue?

Ling Thu, 29 Jun 2017 17:36:26 -0700

Hi, Suneel , that's great. The reason was that I wanted to do something in
DeepLearnig4j and happened to find that openNLP was integrated into it
already. So I just used their API to call openNLP.


Is there a set date for next release? Also, are the 1.5 models the same as
the models to be included in the 1.81 release?

Thanks.
Ling

On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <[email protected]> wrote:

> On Thu, Jun 29, 2017 at 8:07 PM, Ling <[email protected]> wrote:
>
> > Hi, Jörn:
> >
> > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > included the Maven 1.8 version in my POM file, then do I still need to
> > download the models separately? And I can't find those model files. For
> > example, to do a simple test on tokenization model,
> >
>
> Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
> you would go to DL4J first and revert back to OpenNLP if all u want to do
> is basic text processing.
>
> The model files (1.5 models) are presently at -
> http://opennlp.sourceforge.net/models-1.5/
>
>
>
> >
> > InputStream is = new FileInputStream("en-token.bin");
> >
> > Do I have to download the en-token.bin separately? I am working in a
> maven
> > projects. Thank you
>
>
> Yes, the models need to be downloaded separately.
>
> We finally got approval from Apache Foundation to distribute OpenNLP models
> thru Apache, following the upcoming 1.8.1 release we should be distributing
> updated 1.8.1 models too once we hash out the details for doing that.
>
>
> > .
> >
> > Ling
> >
> >
> > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]>
> > wrote:
> >
> > > Long chain, yes, then you probably use the SourceForge tokenization
> > > model that was trained on some old news.
> > >
> > > We usually don't consider mistakes the models do as bugs because we
> > > can't do much about it other than suggesting to use models that fit
> > > your data very well and even in that case models can be wrong
> > > sometimes.
> > >
> > > If there is something we can do here to reduce the error rate then we
> > > are very happy to get that as a contribution or just pointed out.
> > >
> > > Jörn
> > >
> > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
> > > > Hi, Jörn:
> > > >
> > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > think.
> > > > And then UIMA uses openNLP. Probably that's what happens.
> > > >
> > > > So it isn't openNLP's original problem? Thank you.
> > > >
> > > > Ling
> > > >
> > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <[email protected]
> >
> > > wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> which model are you using? Did you train it yourself?
> > > >>
> > > >> Jörn
> > > >>
> > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> wrote:
> > > >> > Hi, all:
> > > >> >
> > > >> > I am testing openNLP and found some significant tokenization issue
> > > >> > involving punctuation.
> > > >> >
> > > >> > Thank you Costco!
> > > >> > i love costco!
> > > >> > I love Costco!!
> > > >> > FUCK IKEA.
> > > >> >
> > > >> > In all these cases, the last punctuation is not split so "Costco!"
> > and
> > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > problem.
> > > >> > Before I file an issue on OpenNLP project, I want to make sure
> this
> > > issue
> > > >> > is true coming from the library.
> > > >> >
> > > >> > Does any of you encounter similar problem? Thanks.
> > > >>
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to