Re: Is this a typical OpenNLP tokenization issue?

Ling Thu, 29 Jun 2017 17:48:07 -0700

These are my original concerns. In the deeplearning4j, which uses openNLP
1.5, they treat "Costco!" and "IKEA." and similar things as one token. Jörn
said it's due to old Models.


Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.

On Thu, Jun 29, 2017 at 5:39 PM, Suneel Marthi <[email protected]>
wrote:

> On Thu, Jun 29, 2017 at 8:36 PM, Ling <[email protected]> wrote:
>
> > Hi, Suneel , that's great. The reason was that I wanted to do something
> in
> > DeepLearnig4j and happened to find that openNLP was integrated into it
> > already. So I just used their API to call openNLP.
> >
> > Is there a set date for next release? Also, are the 1.5 models the same
> as
> > the models to be included in the 1.81 release?
> >
>
> shuld be some time next week.
>
> if u r talking about the usage by 'models being the same', yes nothing
> changes in how u invoke the model from ur code.
>
> >
> > Thanks.
> > Ling
> >
> > On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <[email protected]>
> wrote:
> >
> > > On Thu, Jun 29, 2017 at 8:07 PM, Ling <[email protected]> wrote:
> > >
> > > > Hi, Jörn:
> > > >
> > > > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > > > included the Maven 1.8 version in my POM file, then do I still need
> to
> > > > download the models separately? And I can't find those model files.
> For
> > > > example, to do a simple test on tokenization model,
> > > >
> > >
> > > Dl4j is for Deep learning, OpenNLP is for text processing - not sure
> why
> > > you would go to DL4J first and revert back to OpenNLP if all u want to
> do
> > > is basic text processing.
> > >
> > > The model files (1.5 models) are presently at -
> > > http://opennlp.sourceforge.net/models-1.5/
> > >
> > >
> > >
> > > >
> > > > InputStream is = new FileInputStream("en-token.bin");
> > > >
> > > > Do I have to download the en-token.bin separately? I am working in a
> > > maven
> > > > projects. Thank you
> > >
> > >
> > > Yes, the models need to be downloaded separately.
> > >
> > > We finally got approval from Apache Foundation to distribute OpenNLP
> > models
> > > thru Apache, following the upcoming 1.8.1 release we should be
> > distributing
> > > updated 1.8.1 models too once we hash out the details for doing that.
> > >
> > >
> > > > .
> > > >
> > > > Ling
> > > >
> > > >
> > > > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Long chain, yes, then you probably use the SourceForge tokenization
> > > > > model that was trained on some old news.
> > > > >
> > > > > We usually don't consider mistakes the models do as bugs because we
> > > > > can't do much about it other than suggesting to use models that fit
> > > > > your data very well and even in that case models can be wrong
> > > > > sometimes.
> > > > >
> > > > > If there is something we can do here to reduce the error rate then
> we
> > > > > are very happy to get that as a contribution or just pointed out.
> > > > >
> > > > > Jörn
> > > > >
> > > > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote:
> > > > > > Hi, Jörn:
> > > > > >
> > > > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > > > think.
> > > > > > And then UIMA uses openNLP. Probably that's what happens.
> > > > > >
> > > > > > So it isn't openNLP's original problem? Thank you.
> > > > > >
> > > > > > Ling
> > > > > >
> > > > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
> > [email protected]
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> which model are you using? Did you train it yourself?
> > > > > >>
> > > > > >> Jörn
> > > > > >>
> > > > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]>
> wrote:
> > > > > >> > Hi, all:
> > > > > >> >
> > > > > >> > I am testing openNLP and found some significant tokenization
> > issue
> > > > > >> > involving punctuation.
> > > > > >> >
> > > > > >> > Thank you Costco!
> > > > > >> > i love costco!
> > > > > >> > I love Costco!!
> > > > > >> > FUCK IKEA.
> > > > > >> >
> > > > > >> > In all these cases, the last punctuation is not split so
> > "Costco!"
> > > > and
> > > > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > > > problem.
> > > > > >> > Before I file an issue on OpenNLP project, I want to make sure
> > > this
> > > > > issue
> > > > > >> > is true coming from the library.
> > > > > >> >
> > > > > >> > Does any of you encounter similar problem? Thanks.
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to