Re: OpenNLP: Named Entity Recognition ( Token Name Finder )
Thanks for the feedback. I have evaluated F1 with Maxent, will try with percepron as well. Nikhil Jain Sent from Yahoo Mail on Android From:William Colen william.co...@gmail.com Date:Thu, Jun 18, 2015 at 2:27 AM Subject:Re: OpenNLP: Named Entity Recognition ( Token Name Finder ) I can't remember if the interactions parameter is used in PERCEPTRON. With my experience with other tools, you should use Cutoff 0. Perceptron takes advantage of every feature you add. Did you try the evaluation tools to compute F1? 2015-06-17 13:25 GMT-03:00 nikhil jain nikhil_jain1...@yahoo.com.invalid: Hello, Did anyone get a chance to look at this. Please provide some feedback. Thanks Nikhil Jain Sent from Yahoo Mail on Android From:nikhil jain nikhil_jain1...@yahoo.com.INVALID Date:Tue, Jun 16, 2015 at 4:36 PM Subject:Re: OpenNLP: Named Entity Recognition ( Token Name Finder ) Hi William, Thanks for the link. I have tried both model Maxent and perception on my problem and Perception is working much better than Maxent. I have one question, when I am creating a perceptron model using cutoff 5 and iterations 100 then after 5th iteration model is adjusting itself and not going forward for further iterations, so my question is, is it correct behaviour or I am doing something wrong. Adding some code and logs for the reference. ObjectStreamNameSample sampleStream = new NameSampleDataStream(lineStream); TokenNameFinderModel model = null; TrainingParameters tp = new TrainingParameters(); //tp.put(TrainingParameters.ALGORITHM_PARAM, MAXENT); tp.put(TrainingParameters.ALGORITHM_PARAM, PERCEPTRON); System.out.println(244:Hybrid parser:PERCEPTRON); tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100)); tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5)); tp.put(Threads, 3); opennlp.tools.util.featuregen.AdaptiveFeatureGenerator generator = null; try { MapString, Object resources = null; model = NameFinderME.train( en, security, sampleStream, tp, generator, resources); } catch (IOException e) { Indexing events using cutoff of 5 Computing event counts... done. 8209384 events Indexing... done. Collecting events... Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 8209384 Number of Outcomes: 34 Number of Predicates: 325780 Computing model parameters... Performing 100 iterations. 1: . (8209184/8209384) 0.75637636149 2: . (8209291/8209384) 0.886715008093 3: . (8209340/8209384) 0.946402799528 4: . (8209356/8209384) 0.965892690609 5: . (8209357/8209384) 0.967110808802 Stopping: change in training set accuracy less than 1.0E-5 Stats: (8104703/8209384) 0.9872486169486042 ...done. Compressed 325780 parameters to 3957 532 outcome patterns Thanks Nikhil Sent from Yahoo Mail on Android From:William Colen william.co...@gmail.com Date:Fri, May 29, 2015 at 5:47 PM Subject:Re: OpenNLP: Named Entity Recognition ( Token Name Finder ) The answer about the differences would be quite long. You can learn about the theory researching online. Try some papers from here: https://cwiki.apache.org/confluence/display/OPENNLP/NLP+Papers Which algorithm is better for you depends on your task and your data. You can start developing using the standard Maxent and when your environment is ready you can try other ML implementations. Regards, William 2015-05-29 7:07 GMT-03:00 nikhil jain nikhil_jain1...@yahoo.com.invalid: Hello, Did anyone get a chance to look at the email. I know I am asking a very basic question but being a new in this subject, its very difficult to understand the terms. I tried to understand by reading wiki pages but not fully understand that why I raised a question. Thanks Nikhil Sent from Yahoo Mail on Android From:nikhil jain nikhil_jain1...@yahoo.com Date:Tue, May 19, 2015 at 11:51 PM Subject:OpenNLP: Named Entity Recognition ( Token Name Finder ) Hello Everyone, I was reading a openNLP documentation, and found that OpenNLP supports Maxent, Perceptron and Perceptron sequence type models. Could someone please explain me the difference in between them? I am trying to understand which one would be good for tagging sequence of data. BTW, I am new in NLP and Machine learning. so please help me to understand this. Thanks Nikhil Jain
Re: OpenNLP: Named Entity Recognition ( Token Name Finder )
Hello, Did anyone get a chance to look at the email. I know I am asking a very basic question but being a new in this subject, its very difficult to understand the terms. I tried to understand by reading wiki pages but not fully understand that why I raised a question. Thanks Nikhil Sent from Yahoo Mail on Android From:nikhil jain nikhil_jain1...@yahoo.com Date:Tue, May 19, 2015 at 11:51 PM Subject:OpenNLP: Named Entity Recognition ( Token Name Finder ) Hello Everyone, I was reading a openNLP documentation, and found that OpenNLP supports Maxent, Perceptron and Perceptron sequence type models. Could someone please explain me the difference in between them? I am trying to understand which one would be good for tagging sequence of data. BTW, I am new in NLP and Machine learning. so please help me to understand this. Thanks Nikhil Jain
OpenNLP: Named Entity Recognition ( Token Name Finder )
Hello Everyone, I was reading a openNLP documentation, and found that OpenNLP supports Maxent, Perceptron and Perceptron sequence type models. Could someone please explain me the difference in between them? I am trying to understand which one would be good for tagging sequence of data. BTW, I am new in NLP and Machine learning. so please help me to understand this. ThanksNikhil Jain
Re: Need to speed up the model creation process of OpenNLP
Thanks Samik for the suggestions. #1: I think I should go with this one. I know about confusion matrix(matrix of True positive, False positive and so on) but does openNLP provide any CLI or API's for creating this confusion matrix or do you know any other tool/library which I can use for this. #2 Every time when I add some records or class in my corpus, I need to train it from scratch.So, I don't think so there is a way to retrain the model again. ThanksNikhil From: Samik Raychaudhuri sam...@gmail.com To: dev@opennlp.apache.org Sent: Thursday, November 20, 2014 11:46 PM Subject: Re: Need to speed up the model creation process of OpenNLP Hi Nikhil, #1: What I meant was: see if you can build a model on 1M records, check the confusion matrix and see the performance. Then create a model on 1.5M records, check the confusion matrix and compare. If the improvement is noticeable, then it would essentially make sense to train on more data, on the other hand, if the improvement is not noticeable, then you have already reached a plateau in terms of learning by the model. Please look up confusion matrix related information on the web. #2: Here the approach is somewhat different. If you have specific classes of things that you need to identify, then start off with even smaller data set containing training data related to one such class (say, just 5K~10K set), then add training data incrementally from other classes (and train again - from scratch). Note that, I do not think there is a way to 'warm start' the learning: I do not think you can take a model that has been trained on one class of data, and incrementally make it learn on another set/class of data. That would be a nice research problem. (BTW, if this is already possible, let me know). Bottom line, if you have more data to train, it will take time. You can consider some trade-offs in terms of ML as mentioned above. You should definitely use the above along with parallelization, as mentioned by Rodrigo/Joern - it would be a sin not to use it if you are on a multi-core CPU. You might still need the 10gig java heap to process the data though, IMHO. HTH. Best, -Samik On 19/11/2014 12:09 PM, nikhil jain wrote: Hi Samik, Thank you so much for the quick feedback. 1. You can possibly have smaller training sets and see if the models deteriorate substantially: Yes I have 4 training sets each containing 1 million records but i dont understand how it would be useful? because when I am creating a one model out of these 4 training sets then I have to pass all the records at once for creating a model so it would take time, right? 2. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaroundRight, I am doing the same thing as you mentioned, like I have 4 different classes and each class contains 1 Million records. so initially I created a model on 1 Millions records so it took less time and worked properly then I added another one, so size of the corpus become 2 million and again created a model based on 2 million records and so on, but the problem is when i am adding more records in the corpus then model creation process is taking time.is it possible to reuse the model with new training set, means like i have a model based on 2 million records and now i can say reuse the old model but adjust the model again based on new records. if this is possible then small training sets would be useful, right? As I mentioned, I am new in openNLP and machine learning. so please explain with example if I am missing something. Thanks Nikhil From: Samik Raychaudhuri sam...@gmail.com To: dev@opennlp.apache.org Sent: Wednesday, November 19, 2014 6:00 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi, This is essentially a machine learning problem, nothing to do with OpenNLP. If you have such a large corpus, it would take a substantial amount of time to train models. You can possibly have smaller training sets and see if the models deteriorate substantially. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaround. Hope this help. Best, -Samik On 18/11/2014 8:46 AM, nikhil jain wrote: Hi, I asked below question yesterday, did anyone get a chance to look at this. I am new in OpenNLP and really need some help. Please provide some clue or link or example. ThanksNIkhil From: nikhil jain nikhil_jain1...@yahoo.com.INVALID To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp Apache dev@opennlp.apache.org Sent: Tuesday, November 18, 2014 12:02 AM Subject: Need to speed up the model creation process of OpenNLP Hi, I am using OpenNLP Token Name Finder for parsing the unstructured data. I have created a corpus of about 4 million records. When
Re: Need to speed up the model creation process of OpenNLP
Hi Rodrigo, I was trying to call train method without resource but I was getting some errors. I did not find any train method without resources. I found these train methods in class NameFinderME: 1. train(String languageCode, String type, ObjectStreamNameSample samples,TrainingParameters trainParams, byte[] featureGeneratorBytes, MapString,Object resources) 2. train(String languageCode, String type, ObjectStreamNameSample samples,TrainingParameters trainParams, AdaptiveFeatureGenerator generator,MapString,Object resources) 3. train(String languageCode, String type, ObjectStreamNameSample samples,MapString,Object resources) 4. train(String languageCode, String type, ObjectStreamNameSample samples,AdaptiveFeatureGenerator generator, MapString,Object resources, int iterations, int cutoff) Am I missing something, Could you please tell me how can I do so? ThanksNikhil From: Rodrigo Agerri rage...@apache.org To: nikhil jain nikhil_jain1...@yahoo.com Sent: Friday, November 21, 2014 12:12 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi Nikhil, It looks good, but you do not seem to need the resources, though, you why do not use the train method without the resources? Also, do you have 50 threads? Rodrigo On Thu, Nov 20, 2014 at 5:57 PM, nikhil jain nikhil_jain1...@yahoo.com wrote: Thanks for the feedback Rodrigo. Yes I am trying to create a model based on maximum entropy. As I am using API's for building the model, so I tried adding thread param in the Training parameters object but I am not sure whether I am adding the param correctly or not. I haven't find any clue in documentation as well. Here is my code developed with the help of openNLP documentation. Is it the correct way of creating a maxent model using multi threads? TrainingParameters tp = new TrainingParameters(); tp.put(TrainingParameters.ALGORITHM_PARAM, MAXENT); tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100)); tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5)); tp.put(Threads, 50); MapString, Object resources = new HashMapString, Object(); model = NameFinderME.train( en, sample, sampleStream, tp, generator, resources); Thanks Nikhil From: Rodrigo Agerri rage...@apache.org To: nikhil jain nikhil_jain1...@yahoo.com Sent: Thursday, November 20, 2014 11:35 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi Nikhil The maxent trainer already allows multi thread training. If you are using the cli specify the Threads in your Trainparams file. Check the paramaters file sample distributed with opennlp. If using it via API perhaps the easiest is to create a TrainingParameters object with the threads param specified. HTH R On 19 Nov 2014 21:19, nikhil jain nikhil_jain1...@yahoo.com wrote: Hi Rodrigo, No, I am not using multi-threading, it's a simple Java program, took help from openNLP documentation but it is worth mentioning over here is that as the corpus is containing 4 million records so my Java program running in eclipse was frequently giving me java heap space issue (out of memory issue) so I investigate a bit and found that process was taking around 10GB memory for building the model so i increased the memory to 10 GB using -Xmx parameter. so it worked properly but took 3 hours. Thanks -NIkhil From: Rodrigo Agerri rage...@apache.org To: dev@opennlp.apache.org dev@opennlp.apache.org; nikhil jain nikhil_jain1...@yahoo.com Cc: us...@opennlp.apache.org us...@opennlp.apache.org Sent: Wednesday, November 19, 2014 2:17 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi, Are you using multithreading, lots of threads, RAM memory? R On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain nikhil_jain1...@yahoo.com.invalid wrote: Hi, I asked below question yesterday, did anyone get a chance to look at this. I am new in OpenNLP and really need some help. Please provide some clue or link or example. ThanksNIkhil From: nikhil jain nikhil_jain1...@yahoo.com.INVALID To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp Apache dev@opennlp.apache.org Sent: Tuesday, November 18, 2014 12:02 AM Subject: Need to speed up the model creation process of OpenNLP Hi, I am using OpenNLP Token Name Finder for parsing the unstructured data. I have created a corpus of about 4 million records. When I am creating a model out of the training set using openNLP API's in Eclipse using default setting (cut-off 5 and iterations 100), process is taking a good amount of time, around 2-3 hours. Can someone suggest me how can I reduce the time as I want to experiment with different iterations but as the model creation process is taking so much time, I am not able to experiment with it. This is really a time consuming process. Please provide some feedback. Thanks in advance.Nikhil Jain
Re: Need to speed up the model creation process of OpenNLP
Hi Samik, Thank you so much for the quick feedback. 1. You can possibly have smaller training sets and see if the models deteriorate substantially: Yes I have 4 training sets each containing 1 million records but i dont understand how it would be useful? because when I am creating a one model out of these 4 training sets then I have to pass all the records at once for creating a model so it would take time, right? 2. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaroundRight, I am doing the same thing as you mentioned, like I have 4 different classes and each class contains 1 Million records. so initially I created a model on 1 Millions records so it took less time and worked properly then I added another one, so size of the corpus become 2 million and again created a model based on 2 million records and so on, but the problem is when i am adding more records in the corpus then model creation process is taking time.is it possible to reuse the model with new training set, means like i have a model based on 2 million records and now i can say reuse the old model but adjust the model again based on new records. if this is possible then small training sets would be useful, right? As I mentioned, I am new in openNLP and machine learning. so please explain with example if I am missing something. Thanks Nikhil From: Samik Raychaudhuri sam...@gmail.com To: dev@opennlp.apache.org Sent: Wednesday, November 19, 2014 6:00 AM Subject: Re: Need to speed up the model creation process of OpenNLP Hi, This is essentially a machine learning problem, nothing to do with OpenNLP. If you have such a large corpus, it would take a substantial amount of time to train models. You can possibly have smaller training sets and see if the models deteriorate substantially. Another strategy is to incrementally introduce training sets containing specific class of Token Names - that would provide a quicker turnaround. Hope this help. Best, -Samik On 18/11/2014 8:46 AM, nikhil jain wrote: Hi, I asked below question yesterday, did anyone get a chance to look at this. I am new in OpenNLP and really need some help. Please provide some clue or link or example. ThanksNIkhil From: nikhil jain nikhil_jain1...@yahoo.com.INVALID To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp Apache dev@opennlp.apache.org Sent: Tuesday, November 18, 2014 12:02 AM Subject: Need to speed up the model creation process of OpenNLP Hi, I am using OpenNLP Token Name Finder for parsing the unstructured data. I have created a corpus of about 4 million records. When I am creating a model out of the training set using openNLP API's in Eclipse using default setting (cut-off 5 and iterations 100), process is taking a good amount of time, around 2-3 hours. Can someone suggest me how can I reduce the time as I want to experiment with different iterations but as the model creation process is taking so much time, I am not able to experiment with it. This is really a time consuming process. Please provide some feedback. Thanks in advance.Nikhil Jain