Re: Next release

2014-11-24 Thread Jörn Kottmann

On 11/21/2014 01:26 PM, William Colen wrote:

+1 to start the release process

I candidate myself as release manager for the 1.6.0.




+1 for William as RM

Jörn


Re: Next release

2014-11-24 Thread Rodrigo Agerri
+1 for William as RM

R

On Mon, Nov 24, 2014 at 3:18 PM, Jörn Kottmann kottm...@gmail.com wrote:
 On 11/21/2014 01:26 PM, William Colen wrote:

 +1 to start the release process

 I candidate myself as release manager for the 1.6.0.



 +1 for William as RM

 Jörn


Re: Need to speed up the model creation process of OpenNLP

2014-11-24 Thread nikhil jain
Thanks Samik for the suggestions.
#1: I think I should go with this one. I know about confusion matrix(matrix of 
True positive, False positive and so on) but does openNLP provide any CLI or 
API's for creating this confusion matrix or do you know any other tool/library 
which I can use for this.
#2 Every time when I add some records or class in my corpus, I need to train it 
from scratch.So, I don't think so there is a way to retrain the model again.
ThanksNikhil
  From: Samik Raychaudhuri sam...@gmail.com
 To: dev@opennlp.apache.org 
 Sent: Thursday, November 20, 2014 11:46 PM
 Subject: Re: Need to speed up the model creation process of OpenNLP
   
Hi Nikhil,

#1: What I meant was: see if you can build a model on 1M records, check 
the confusion matrix and see the performance. Then create a model on 
1.5M records, check the confusion matrix and compare. If the improvement 
is noticeable, then it would essentially make sense to train on more 
data, on the other hand, if the improvement is not noticeable, then you 
have already reached a plateau in terms of learning by the model. Please 
look up confusion matrix related information on the web.

#2: Here the approach is somewhat different. If you have specific 
classes of things that you need to identify, then start off with even 
smaller data set containing training data related to one such class 
(say, just 5K~10K set), then add training data incrementally from other 
classes (and train again - from scratch). Note that, I do not think 
there is a way to 'warm start' the learning: I do not think you can take 
a model that has been trained on one class of data, and incrementally 
make it learn on another set/class of data. That would be a nice 
research problem. (BTW, if this is already possible, let me know).

Bottom line, if you have more data to train, it will take time. You can 
consider some trade-offs in terms of ML as mentioned above. You should 
definitely use the above along with parallelization, as mentioned by 
Rodrigo/Joern - it would be a sin not to use it if you are on a 
multi-core CPU. You might still need the 10gig java heap to process the 
data though, IMHO.

HTH.
Best,
-Samik



On 19/11/2014 12:09 PM, nikhil jain wrote:
 Hi Samik,
 Thank you so much for the quick feedback.
 1. You can possibly have smaller training sets and see if the models 
 deteriorate substantially:
 Yes I have 4 training sets each containing 1 million records but i dont 
 understand how it would be useful? because when I am creating a one model out 
 of these 4 training sets then I have to pass all the records at once for 
 creating a model so it would take time, right?
 2. Another strategy is to incrementally introduce training sets containing 
 specific class of Token Names - that would provide a quicker turnaroundRight, 
 I am doing the same thing as you mentioned, like I have 4 different classes 
 and each class contains 1 Million records. so initially I created a model on 
 1 Millions records so it took less time and worked properly then I added 
 another one, so size of the corpus become 2 million and again created a model 
 based on 2 million records and so on, but the problem is when i am adding 
 more records in the corpus then model creation process is taking time.is it 
 possible to reuse the model with new training set, means like i have a model 
 based on 2 million records and now i can say reuse the old model but adjust 
 the model again based on new records. if this is possible then small training 
 sets would be useful, right?
 As I mentioned, I am new in openNLP and machine learning. so please explain 
 with example if I am missing something.

 Thanks Nikhil
        From: Samik Raychaudhuri sam...@gmail.com
  To: dev@opennlp.apache.org
  Sent: Wednesday, November 19, 2014 6:00 AM
  Subject: Re: Need to speed up the model creation process of OpenNLP
    
 Hi,
 This is essentially a machine learning problem, nothing to do with
 OpenNLP. If you have such a large corpus, it would take a substantial
 amount of time to train models. You can possibly have smaller training
 sets and see if the models deteriorate substantially. Another strategy
 is to incrementally introduce training sets containing specific class of
 Token Names - that would provide a quicker turnaround.
 Hope this help.
 Best,
 -Samik




 On 18/11/2014 8:46 AM, nikhil jain wrote:
 Hi,
 I asked below question yesterday, did anyone get a chance to look at this.
 I am new in OpenNLP and really need some help. Please provide some clue or 
 link or example.
 ThanksNIkhil
          From: nikhil jain nikhil_jain1...@yahoo.com.INVALID
    To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp 
Apache dev@opennlp.apache.org
    Sent: Tuesday, November 18, 2014 12:02 AM
    Subject: Need to speed up the model creation process of OpenNLP
      
 Hi,
 I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
 have created a corpus of about 4 million records. When 

Re: Need to speed up the model creation process of OpenNLP

2014-11-24 Thread nikhil jain
Hi Rodrigo,
I was trying to call train method without resource but I was getting some 
errors. I did not find any train method without resources.
I found these train methods in class NameFinderME:
1. train(String languageCode, String type, ObjectStreamNameSample 
samples,TrainingParameters trainParams, byte[] featureGeneratorBytes, 
MapString,Object resources) 
2. train(String languageCode, String type, ObjectStreamNameSample 
samples,TrainingParameters trainParams, AdaptiveFeatureGenerator 
generator,MapString,Object resources) 3. train(String languageCode, String 
type, ObjectStreamNameSample samples,MapString,Object resources) 4. 
train(String languageCode, String type, ObjectStreamNameSample 
samples,AdaptiveFeatureGenerator generator, MapString,Object resources, int 
iterations, int cutoff)
Am I missing something, Could you please tell me how can I do so?
ThanksNikhil
  From: Rodrigo Agerri rage...@apache.org
 To: nikhil jain nikhil_jain1...@yahoo.com 
 Sent: Friday, November 21, 2014 12:12 AM
 Subject: Re: Need to speed up the model creation process of OpenNLP
   
Hi Nikhil,

It looks good, but you do not seem to need the resources, though, you
why do not use the train method without the resources?

Also, do you have 50 threads?

Rodrigo



On Thu, Nov 20, 2014 at 5:57 PM, nikhil jain nikhil_jain1...@yahoo.com wrote:
 Thanks for the feedback Rodrigo.
 Yes I am trying to create a model based on maximum entropy. As I am using
 API's for building the model, so I tried adding thread param in the Training
 parameters object but  I am not sure whether I am adding the param correctly
 or not. I haven't find any clue in documentation as well.

 Here is my code developed with the help of openNLP documentation. Is it the
 correct way of creating a maxent model using multi threads?

 TrainingParameters tp = new TrainingParameters();
 tp.put(TrainingParameters.ALGORITHM_PARAM, MAXENT);
 tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100));
 tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5));
 tp.put(Threads, 50);

 MapString, Object resources = new HashMapString, Object();
 model = NameFinderME.train( en, sample, sampleStream, tp, generator,
 resources);
 Thanks
 Nikhil


 
 From: Rodrigo Agerri rage...@apache.org
 To: nikhil jain nikhil_jain1...@yahoo.com
 Sent: Thursday, November 20, 2014 11:35 AM

 Subject: Re: Need to speed up the model creation process of OpenNLP

 Hi Nikhil
 The maxent trainer already allows multi thread training. If you are using
 the cli specify the Threads in your Trainparams file. Check the paramaters
 file sample distributed with opennlp.
 If using it via API perhaps the easiest is to create a TrainingParameters
 object with the threads param specified.
 HTH
 R


 On 19 Nov 2014 21:19, nikhil jain nikhil_jain1...@yahoo.com wrote:

 Hi Rodrigo,

 No, I am not using multi-threading, it's a simple Java program, took help
 from openNLP documentation but it is worth mentioning over here is that as
 the corpus is containing 4 million records so my Java program running in
 eclipse was frequently giving me java heap space issue (out of memory issue)
 so I investigate a bit and found that process was taking around 10GB memory
 for building the model so i increased the memory to 10 GB using -Xmx
 parameter. so it worked properly but took 3 hours.

 Thanks
 -NIkhil

 
 From: Rodrigo Agerri rage...@apache.org
 To: dev@opennlp.apache.org dev@opennlp.apache.org; nikhil jain
 nikhil_jain1...@yahoo.com
 Cc: us...@opennlp.apache.org us...@opennlp.apache.org
 Sent: Wednesday, November 19, 2014 2:17 AM
 Subject: Re: Need to speed up the model creation process of OpenNLP

 Hi,

 Are you using multithreading, lots of threads, RAM memory?

 R




 On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
 nikhil_jain1...@yahoo.com.invalid wrote:
 Hi,
 I asked below question yesterday, did anyone get a chance to look at this.
 I am new in OpenNLP and really need some help. Please provide some clue or
 link or example.
 ThanksNIkhil
      From: nikhil jain nikhil_jain1...@yahoo.com.INVALID
  To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp
 Apache dev@opennlp.apache.org
  Sent: Tuesday, November 18, 2014 12:02 AM
  Subject: Need to speed up the model creation process of OpenNLP

 Hi,
 I am using OpenNLP Token Name Finder for parsing the unstructured data. I
 have created a corpus of about 4 million records. When I am creating a model
 out of the training set using openNLP API's in Eclipse using default setting
 (cut-off 5 and iterations 100), process is taking a good amount of time,
 around 2-3 hours.
 Can someone suggest me how can I reduce the time as I want to experiment
 with different iterations but as the model creation process is taking so
 much time, I am not able to experiment with it. This is really a time
 consuming process.
 Please provide some feedback.
 Thanks in advance.Nikhil Jain