[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

helenahm Tue, 01 Aug 2017 23:53:40 -0700

Github user helenahm commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/93
  
    It will include some work. 
    
    Let me explain.
    
    You were right when you have said that OpenNLP implementation is poor 
memory-wise. Indeed, they store data in [][] and few times. Using their code 
directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of 
data rows. Newer version of code has same problems.) And you were right about 
the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. 
Thus, more or less, I have changed all the [][] (related to input data) to 
CSRMatrix and [][] holding weights to  DoKMatrix. 
    
    
    To explain that more, it is best to look at source code for the GISTrainer. 
In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. 
The links are below. 
    
    Newer GISTrainer:
    
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java
    
    Older (3.0.0) GISTrainer:
    https://sourceforge.net/projects/maxent/files/ - whole achive
    GISTrainer attached:
    
[GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt)
    
    Hivemall GISTrainer:
    
https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java
    
    Notice how trainModel of BigGISTrainer gets MatrixForTraining 
(https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java),
 that contains references to Matrix, and outcomes. This is CSRMatrix. 
    
    And row data is collected from the CSRMatrix in MatrixForTraining instead 
of the double[][]. 
    
    when
    ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), 
di.getOMap());
    (they use this convenience Event thing to work with a row of data. Instead 
of storing a List of Events in memory the modified code also builds an event 
when needed.)
    
    and results are stored in 
    Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] 
again.
    
    GISTrainer did not change very dramatically. If 3.0.0 training is reliable 
enough, I would, of course, consider the existing version as 1.0, and did all 
the effort to adapt GISTrainer later on. It makes sense to do that, I totally 
agree. And perhaps it makes sense to continue after that to understanding 
training process in greater details and perhaps write a newer comparable 
trainer that will be independent from OpenNLP.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

Reply via email to