Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 It will include some work. Let me explain. You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed, they store data in [][] and few times. Using their code directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.) And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix and [][] holding weights to DoKMatrix. To explain that more, it is best to look at source code for the GISTrainer. In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below. Newer GISTrainer: https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java Older (3.0.0) GISTrainer: https://sourceforge.net/projects/maxent/files/ - whole achive GISTrainer attached: [GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt) Hivemall GISTrainer: https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java), that contains references to Matrix, and outcomes. This is CSRMatrix. And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][]. when ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), di.getOMap()); (they use this convenience Event thing to work with a row of data. Instead of storing a List of Events in memory the modified code also builds an event when needed.) and results are stored in Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] again. GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would, of course, consider the existing version as 1.0, and did all the effort to adapt GISTrainer later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue after that to understanding training process in greater details and perhaps write a newer comparable trainer that will be independent from OpenNLP.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---