[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui I share the concern that the modification can be based on the latest 
release of Apache OpenNLP,  v1.8.1  if there are no reason to use pre-apache 
release. If I knew about the newer version of maxent at the very beginning I 
would have used it. 

I will examine the newer maxent code in the next few days. 

As you said, have a look at the PR when you have time. And then a decision 
what to do can be made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui that was done for the 1.6.0 release, and in maxent 3.0.3 it was 
modified to run in multiple threads. 

You probably need to take a similar approach as we took for multi-threaded 
training e.g. split the amount of work done per iteration and scale it out to 
multiple machines, merge the parameters, and repeat for the next iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@kottmann Do you know in which version maxent classifier is moved to 
opennlp-tools?
Versioning scheme of opennlp-maxent and opennlp-tools modules are very 
different.

https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-maxent
https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-tools


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui the maxent 3.0.1 version went through Apache IP clearance when the 
code base was moved from SourceForge, and should be almost identical to 3.0.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Sure, there are ways to make this work across multiple machines, but then 
you can't use it like we ship it. Maybe the best solution for you would be to 
just take the code you need, strip it down and get rid of opennlp as a 
dependency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm I agree to use Hivemall's Matrix to reduce memory consumption and 
create a custom BigGISTrainer for Hivemall.

My concern is that the modification can be based on the latest release of 
Apache OpenNLP, `v1.8.1` if there are no reason to use pre-apache release. 

Anyway, I look into your PR after merging 
https://github.com/apache/incubator-hivemall/pull/105 
Maybe in the next week. Some refactoring would be applied (such as removing 
debug prints and unused codes) forking your PR branch. 

BTW, multi-thresholding should be avoided when running a task in a Yarn 
container. Better to be parallelized by Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
"Yeah, so if you have a lot of training data then running out of memory is 
one symptom you run into, but that is not the actual problem of this 
implementation." 
- it was the big problem for me to use on Hadoop and that is why i had to 
alter the training code 
- the newer version of the code is as bad as the old one from this point of 
view

"The actual cause is that it won't scale beyond one machine." 
- yes, that is why I really like what Hivemall project is about, and that 
is why i needed MaxEnt for Hive 

"In case you manage to make this run with much more data the time it will 
take to run will be uncomfortably high." 
-- that is why i have tested my new implementation on almost 100 mils of 
training samples and saw each of 302 mappers finish work in very reasonable 
time 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread kottmann
Github user kottmann commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm as far as I know the training data is stored once in memory, and 
then for each thread a copy of the parameters is stored. 

Yeah, so if you have a lot of training data then running out of memory is 
one symptom you run into, but that is not the actual problem of this 
implementation. The actual cause is that it won't scale beyond one machine.

Bottom line if you want to use GIS training with lots of data don't use 
this implementation,  the training requires a certain amount of CPU time and it 
increases with the amount of training data. In case you manage to make this run 
with much more data the time it will take to run will be uncomfortably high.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-01 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
It will include some work. 

Let me explain.

You were right when you have said that OpenNLP implementation is poor 
memory-wise. Indeed, they store data in [][] and few times. Using their code 
directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of 
data rows. Newer version of code has same problems.) And you were right about 
the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. 
Thus, more or less, I have changed all the [][] (related to input data) to 
CSRMatrix and [][] holding weights to  DoKMatrix. 


To explain that more, it is best to look at source code for the GISTrainer. 
In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. 
The links are below. 

Newer GISTrainer:

https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java

Older (3.0.0) GISTrainer:
https://sourceforge.net/projects/maxent/files/ - whole achive
GISTrainer attached:

[GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt)

Hivemall GISTrainer:

https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java

Notice how trainModel of BigGISTrainer gets MatrixForTraining 
(https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java),
 that contains references to Matrix, and outcomes. This is CSRMatrix. 

And row data is collected from the CSRMatrix in MatrixForTraining instead 
of the double[][]. 

when
ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), 
di.getOMap());
(they use this convenience Event thing to work with a row of data. Instead 
of storing a List of Events in memory the modified code also builds an event 
when needed.)

and results are stored in 
Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] 
again.

GISTrainer did not change very dramatically. If 3.0.0 training is reliable 
enough, I would, of course, consider the existing version as 1.0, and did all 
the effort to adapt GISTrainer later on. It makes sense to do that, I totally 
agree. And perhaps it makes sense to continue after that to understanding 
training process in greater details and perhaps write a newer comparable 
trainer that will be independent from OpenNLP. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-01 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm I think it's better to use [Apache 
opennlp-tools](https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-tools)
 `v1.8.1` as @kottmann 
 mentioned.

`v3.0.0` is not supported anymore and may have some bugs for training other 
datasets.
Could you explain difficulties in applying `v1.8.1` a bit more? Has API 
been changed significantly between `v3.0.0` and `v1.8.1`?

[Notice 
file](https://github.com/apache/incubator-hivemall/blob/master/NOTICE) should 
be updated too if ported sources (not jar) are included.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-01 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I think the code is ready to be checked and pulled now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Not sure. Please avoid about it for coveralls. 

It seems we need to fix Travis config due to this issue.
https://github.com/travis-ci/travis-ci/issues/7964

Verbose outputs from `MaxEntUDTFTest` and `MaxEntPredictUDFTest` should be 
avoided.
https://travis-ci.org/apache/incubator-hivemall/jobs/256887101

Also, it seems CI is killed due to some reason while the latest commit in 
the master is build passing.

https://docs.travis-ci.com/user/common-build-problems/#My-build-script-is-killed-without-any-error


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
"Changes Unknown when pulling b4383db on helenahm:master into ** on 
apache:master**."

What does that mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread coveralls
Github user coveralls commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  

[![Coverage 
Status](https://coveralls.io/builds/12520483/badge)](https://coveralls.io/builds/12520483)

Changes Unknown when pulling **b4383dbe3880494bd24fc54c1a22edbc9c51aa7e on 
helenahm:master** into ** on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I have also looked at the resulting models last week. Formally, I added 
only one extra test to tests: hivemall.opennlp.tools.MaxEntPredictUDFTest.java. 

/**
 * Compare MaxEntropy in HiveMall with that of OpenNLP 3.0.0
 */
@Test
public void testResemblenceToOpenNLP() throws Exception {
}

The code inside the test gets access to internal model representation and 
compares feature weights. 

Using similar code I have looked at the feature weights in 3 models that 
were relevant for my dataset. All the models look reasonable, that is, key 
class features get high weights, and models predict what they should. One of 
the models is an aggregated model (aggregated from the 302 I have got from 
mappers). This one looks reasonable too.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-18 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Yes. I will do all that in the next few days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-17 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
- [ ] CI is failing due to missing License header. You can use 
`bin/format_header.sh` to update headers.
- [ ] Could you apply source code formatter, i.e., `mvn formatter:format`, 
for your commit?
- [ ] package should be `hivemall.opennlp` , not `hivemall.smile` .
- [ ] This function should be `hivemall-nlp` module (refer KuromojiUDF)
- [ ] DDLs should be described in `resources/ddl/define-additional.hive`. 
(grep `tokenize_ja` for a reference)
- [ ] Documentation is required in `docs/gitbook` .  `npm install 
gitbook-cli; gitbook install;  gitbook serve` on  `docs/gitbook` .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-10 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I'll keep my eye on this issue. Let me know when your conclusion is led.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-10 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Sure. OpenNLP one will be scalable too. The code is still not perfect now, 
but (!) 

97304256 rows of data
that required 302 mappers on my machine, and 
set hive.execution.engine=mr;
set mapreduce.task.timeout=180;

no memory issues, I had 302 models back and aggregated them into one after.

Still have to work on the code and check whether the resulting model makes 
sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-06 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@helenahm opennlp-based code might also be useful. Instead of rewriting, 
having both Smile-based Multinominal Logistic Regression and OpenNLP-based 
MaxEnt classifier is an option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-05 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I think you are right about CSRMatrix (_x here). I should use it without 
converting it back to unscaleable structures via DataIndexer.

As I literary did by:

EventStream es = new MatrixEventStream(_x, _y, _attributes); 
AbstractModel model;
try {
model = GIS.trainModel(1000, new OnePassRealValueDataIndexer(es,0), 
_USE_SMOOTHING);
} catch (IOException e) {
throw new HiveException(e.getMessage());
}

I will re-write the code to be more scalable, by changing OpenNLP code to a 
code that uses Smile's and Hivemall's structures. I was mistaken about the ways 
OpenNLP processes the data from the EventStream. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---