Re: Next release (was: Re: 1.6.0 maven repo)
+1 for the release of 1.6.0 RC Vinh On Thu, Nov 20, 2014 at 3:24 PM, Joern Kottmann wrote: > Yes, all the important issues, expect one (OPENNLP-730) are closed now. > There are still a couple of issues open about name finder feature > generators, but those could also be added to OpenNLP in a 1.6.1 release > or during testing. > > +1 to make the first RC for 1.6.0 and start testing it > > Jörn > > On Thu, 2014-11-20 at 07:33 +, Rodrigo Agerri wrote: > > +1 to start making a release. I would like to be involved too. > > > > R > > On 19 Nov 2014 23:40, "Joern Kottmann" wrote: > > > > > Hello, > > > > > > yes, that should be the current state. > > > > > > Can you please elaborate on the issue you have. > > > Do you get an old version? > > > > > > We should try to make a release of 1.6.0, I think most issues > > > are already solved and remaining bugs we will uncover during the manual > > > testing phase. > > > > > > Jörn > > > > > > On Wed, 2014-11-19 at 21:20 +0100, Rodrigo Agerri wrote: > > > > Hi > > > > > > > > Any chance to release snapshot repos to maven central? Or to an > apache > > > > snapshots repo? > > > > > > > > It would make the use of current trunk via API much easier. > > > > > > > > Cheers > > > > > > > > Rodrigo > > > > > > > > > > > > -- Vinh Khuc
Next release (was: Re: 1.6.0 maven repo)
Yes, all the important issues, expect one (OPENNLP-730) are closed now. There are still a couple of issues open about name finder feature generators, but those could also be added to OpenNLP in a 1.6.1 release or during testing. +1 to make the first RC for 1.6.0 and start testing it Jörn On Thu, 2014-11-20 at 07:33 +, Rodrigo Agerri wrote: > +1 to start making a release. I would like to be involved too. > > R > On 19 Nov 2014 23:40, "Joern Kottmann" wrote: > > > Hello, > > > > yes, that should be the current state. > > > > Can you please elaborate on the issue you have. > > Do you get an old version? > > > > We should try to make a release of 1.6.0, I think most issues > > are already solved and remaining bugs we will uncover during the manual > > testing phase. > > > > Jörn > > > > On Wed, 2014-11-19 at 21:20 +0100, Rodrigo Agerri wrote: > > > Hi > > > > > > Any chance to release snapshot repos to maven central? Or to an apache > > > snapshots repo? > > > > > > It would make the use of current trunk via API much easier. > > > > > > Cheers > > > > > > Rodrigo > > > > > >
Re: Build changed opennlp/pom.xml moved to root directory
IMHO it was about time, thanks Jörn :-) Regards, Tommaso 2014-11-20 21:11 GMT+01:00 Joern Kottmann : > Hello everybody, > > we changed the structure of the project slightly. The main pom.xml used > to be located in opennlp/pom.xml. This was done because an Eclipse > workspace can't have files at the root level. The Maven convention is to > have the file at the root level. I think it is time to move this file to > the root directory to not anymore confuse Maven users (and maybe some > tools) which expect the file in the root directory. > > Please let me know if there are any objections to this. > > To build OpenNLP from now on just go the trunk directory and type "mvn > install". > > Jörn > >
Build changed opennlp/pom.xml moved to root directory
Hello everybody, we changed the structure of the project slightly. The main pom.xml used to be located in opennlp/pom.xml. This was done because an Eclipse workspace can't have files at the root level. The Maven convention is to have the file at the root level. I think it is time to move this file to the root directory to not anymore confuse Maven users (and maybe some tools) which expect the file in the root directory. Please let me know if there are any objections to this. To build OpenNLP from now on just go the trunk directory and type "mvn install". Jörn
Re: Need to speed up the model creation process of OpenNLP
Hi Nikhil, #1: What I meant was: see if you can build a model on 1M records, check the confusion matrix and see the performance. Then create a model on 1.5M records, check the confusion matrix and compare. If the improvement is noticeable, then it would essentially make sense to train on more data, on the other hand, if the improvement is not noticeable, then you have already reached a plateau in terms of learning by the model. Please look up confusion matrix related information on the web. #2: Here the approach is somewhat different. If you have specific classes of things that you need to identify, then start off with even smaller data set containing training data related to one such class (say, just 5K~10K set), then add training data incrementally from other classes (and train again - from scratch). Note that, I do not think there is a way to 'warm start' the learning: I do not think you can take a model that has been trained on one class of data, and incrementally make it learn on another set/class of data. That would be a nice research problem. (BTW, if this is already possible, let me know). Bottom line, if you have more data to train, it will take time. You can consider some trade-offs in terms of ML as mentioned above. You should definitely use the above along with parallelization, as mentioned by Rodrigo/Joern - it would be a sin not to use it if you are on a multi-core CPU. You might still need the 10gig java heap to process the data though, IMHO. HTH. Best, -Samik On 19/11/2014 12:09 PM, nikhil jain wrote: Hi Samik, Thank you so much for the quick feedback. 1. You can possibly have smaller training sets and see if the models deteriorate substantially: Yes I have 4 training sets each containing 1 million records but i dont understand how it would be useful? because when I am creating a one model out of these 4 training sets then I have to pass all the records at once for creating a model so it would take time, right? 2. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaroundRight, I am doing the same thing as you mentioned, like I have 4 different classes and each class contains 1 Million records. so initially I created a model on 1 Millions records so it took less time and worked properly then I added another one, so size of the corpus become 2 million and again created a model based on 2 million records and so on, but the problem is when i am adding more records in the corpus then model creation process is taking time.is it possible to reuse the model with new training set, means like i have a model based on 2 million records and now i can say reuse the old model but adjust the model again based on new records. if this is possible then small training sets would be useful, right? As I mentioned, I am new in openNLP and machine learning. so please explain with example if I am missing something. Thanks Nikhil From: Samik Raychaudhuri To: dev@opennlp.apache.org Sent: Wednesday, November 19, 2014 6:00 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi, This is essentially a machine learning problem, nothing to do with OpenNLP. If you have such a large corpus, it would take a substantial amount of time to train models. You can possibly have smaller training sets and see if the models deteriorate substantially. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaround. Hope this help. Best, -Samik On 18/11/2014 8:46 AM, nikhil jain wrote: Hi, I asked below question yesterday, did anyone get a chance to look at this. I am new in OpenNLP and really need some help. Please provide some clue or link or example. ThanksNIkhil From: nikhil jain To: "us...@opennlp.apache.org" ; Dev at Opennlp Apache Sent: Tuesday, November 18, 2014 12:02 AM Subject: Need to speed up the model creation process of OpenNLP Hi, I am using OpenNLP Token Name Finder for parsing the unstructured data. I have created a corpus of about 4 million records. When I am creating a model out of the training set using openNLP API's in Eclipse using default setting (cut-off 5 and iterations 100), process is taking a good amount of time, around 2-3 hours. Can someone suggest me how can I reduce the time as I want to experiment with different iterations but as the model creation process is taking so much time, I am not able to experiment with it. This is really a time consuming process. Please provide some feedback. Thanks in advance.Nikhil Jain
Re: 1.6.0 maven repo
We should probably explain it on this page: http://opennlp.apache.org/maven-dependency.html Jörn On 11/20/2014 09:48 AM, Rodrigo Agerri wrote: Hi, On Thu, Nov 20, 2014 at 7:28 AM, Jörn Kottmann wrote: You probably need to include the Apache snapshot repository in your pom to make that work. https://repository.apache.org/content/repositories/snapshots/ This works, thanks. +1 to include this in the documentation. I will open an issue. Cheers, R
Re: 1.6.0 maven repo
Hi, On Thu, Nov 20, 2014 at 7:28 AM, Jörn Kottmann wrote: > You probably need to include the Apache snapshot repository in > your pom to make that work. > > https://repository.apache.org/content/repositories/snapshots/ This works, thanks. +1 to include this in the documentation. I will open an issue. Cheers, R