Well u could wait until next release for newer models Sent from my iPhone
> On Jun 29, 2017, at 8:47 PM, Ling <[email protected]> wrote: > > These are my original concerns. In the deeplearning4j, which uses openNLP > 1.5, they treat "Costco!" and "IKEA." and similar things as one token. Jörn > said it's due to old Models. > > Thank you Costco! > i love costco! > I love Costco!! > FUCK IKEA. > > On Thu, Jun 29, 2017 at 5:39 PM, Suneel Marthi <[email protected]> > wrote: > >>> On Thu, Jun 29, 2017 at 8:36 PM, Ling <[email protected]> wrote: >>> >>> Hi, Suneel , that's great. The reason was that I wanted to do something >> in >>> DeepLearnig4j and happened to find that openNLP was integrated into it >>> already. So I just used their API to call openNLP. >>> >>> Is there a set date for next release? Also, are the 1.5 models the same >> as >>> the models to be included in the 1.81 release? >>> >> >> shuld be some time next week. >> >> if u r talking about the usage by 'models being the same', yes nothing >> changes in how u invoke the model from ur code. >> >>> >>> Thanks. >>> Ling >>> >>> On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <[email protected]> >> wrote: >>> >>>>> On Thu, Jun 29, 2017 at 8:07 PM, Ling <[email protected]> wrote: >>>>> >>>>> Hi, Jörn: >>>>> >>>>> I want to directly use openNLP, instead of deeplearning4j and UIMA. I >>>>> included the Maven 1.8 version in my POM file, then do I still need >> to >>>>> download the models separately? And I can't find those model files. >> For >>>>> example, to do a simple test on tokenization model, >>>>> >>>> >>>> Dl4j is for Deep learning, OpenNLP is for text processing - not sure >> why >>>> you would go to DL4J first and revert back to OpenNLP if all u want to >> do >>>> is basic text processing. >>>> >>>> The model files (1.5 models) are presently at - >>>> http://opennlp.sourceforge.net/models-1.5/ >>>> >>>> >>>> >>>>> >>>>> InputStream is = new FileInputStream("en-token.bin"); >>>>> >>>>> Do I have to download the en-token.bin separately? I am working in a >>>> maven >>>>> projects. Thank you >>>> >>>> >>>> Yes, the models need to be downloaded separately. >>>> >>>> We finally got approval from Apache Foundation to distribute OpenNLP >>> models >>>> thru Apache, following the upcoming 1.8.1 release we should be >>> distributing >>>> updated 1.8.1 models too once we hash out the details for doing that. >>>> >>>> >>>>> . >>>>> >>>>> Ling >>>>> >>>>> >>>>> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <[email protected] >>> >>>>> wrote: >>>>> >>>>>> Long chain, yes, then you probably use the SourceForge tokenization >>>>>> model that was trained on some old news. >>>>>> >>>>>> We usually don't consider mistakes the models do as bugs because we >>>>>> can't do much about it other than suggesting to use models that fit >>>>>> your data very well and even in that case models can be wrong >>>>>> sometimes. >>>>>> >>>>>> If there is something we can do here to reduce the error rate then >> we >>>>>> are very happy to get that as a contribution or just pointed out. >>>>>> >>>>>> Jörn >>>>>> >>>>>>> On Thu, Jun 29, 2017 at 6:54 PM, Ling <[email protected]> wrote: >>>>>>> Hi, Jörn: >>>>>>> >>>>>>> I am using a Deeplearning4j, which uses org.apache.uima library I >>>>> think. >>>>>>> And then UIMA uses openNLP. Probably that's what happens. >>>>>>> >>>>>>> So it isn't openNLP's original problem? Thank you. >>>>>>> >>>>>>> Ling >>>>>>> >>>>>>> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann < >>> [email protected] >>>>> >>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> which model are you using? Did you train it yourself? >>>>>>>> >>>>>>>> Jörn >>>>>>>> >>>>>>>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <[email protected]> >> wrote: >>>>>>>>> Hi, all: >>>>>>>>> >>>>>>>>> I am testing openNLP and found some significant tokenization >>> issue >>>>>>>>> involving punctuation. >>>>>>>>> >>>>>>>>> Thank you Costco! >>>>>>>>> i love costco! >>>>>>>>> I love Costco!! >>>>>>>>> FUCK IKEA. >>>>>>>>> >>>>>>>>> In all these cases, the last punctuation is not split so >>> "Costco!" >>>>> and >>>>>>>>> "IKEA." are treated as one token. This looks like a systematic >>>>>> problem. >>>>>>>>> Before I file an issue on OpenNLP project, I want to make sure >>>> this >>>>>> issue >>>>>>>>> is true coming from the library. >>>>>>>>> >>>>>>>>> Does any of you encounter similar problem? Thanks. >>>>>>>> >>>>>> >>>>> >>>> >>> >>
