[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
@myui I share the concern that the modification can be based on the latest 
release of Apache OpenNLP,  v1.8.1  if there are no reason to use pre-apache 
release. If I knew about the newer version of maxent at the very beginning I 
would have used it. 

I will examine the newer maxent code in the next few days. 

As you said, have a look at the PR when you have time. And then a decision 
what to do can be made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
"Yeah, so if you have a lot of training data then running out of memory is 
one symptom you run into, but that is not the actual problem of this 
implementation." 
- it was the big problem for me to use on Hadoop and that is why i had to 
alter the training code 
- the newer version of the code is as bad as the old one from this point of 
view

"The actual cause is that it won't scale beyond one machine." 
- yes, that is why I really like what Hivemall project is about, and that 
is why i needed MaxEnt for Hive 

"In case you manage to make this run with much more data the time it will 
take to run will be uncomfortably high." 
-- that is why i have tested my new implementation on almost 100 mils of 
training samples and saw each of 302 mappers finish work in very reasonable 
time 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-01 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
It will include some work. 

Let me explain.

You were right when you have said that OpenNLP implementation is poor 
memory-wise. Indeed, they store data in [][] and few times. Using their code 
directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of 
data rows. Newer version of code has same problems.) And you were right about 
the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. 
Thus, more or less, I have changed all the [][] (related to input data) to 
CSRMatrix and [][] holding weights to  DoKMatrix. 


To explain that more, it is best to look at source code for the GISTrainer. 
In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. 
The links are below. 

Newer GISTrainer:

https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java

Older (3.0.0) GISTrainer:
https://sourceforge.net/projects/maxent/files/ - whole achive
GISTrainer attached:

[GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt)

Hivemall GISTrainer:

https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java

Notice how trainModel of BigGISTrainer gets MatrixForTraining 
(https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java),
 that contains references to Matrix, and outcomes. This is CSRMatrix. 

And row data is collected from the CSRMatrix in MatrixForTraining instead 
of the double[][]. 

when
ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), 
di.getOMap());
(they use this convenience Event thing to work with a row of data. Instead 
of storing a List of Events in memory the modified code also builds an event 
when needed.)

and results are stored in 
Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] 
again.

GISTrainer did not change very dramatically. If 3.0.0 training is reliable 
enough, I would, of course, consider the existing version as 1.0, and did all 
the effort to adapt GISTrainer later on. It makes sense to do that, I totally 
agree. And perhaps it makes sense to continue after that to understanding 
training process in greater details and perhaps write a newer comparable 
trainer that will be independent from OpenNLP. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-08-01 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I think the code is ready to be checked and pulled now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-08-01 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r130533347
  
--- Diff: core/pom.xml ---
@@ -103,6 +103,12 @@
${guava.version}
provided

+   
+   opennlp
+   maxent
+   3.0.0
--- End diff --

In general I totally agree. I think it would be good to perform the move to 
another version of maxent in a few steps.

1. The code I have re-used is that of GISTrainer. That is more or less 
updating the weights in a matrix where matrix is hivemall's matrix. Everything 
else is just following your class structure. I have checked that the resulting 
models are the same and I have also confirmed that the resulting model makes 
sense on my own data. So the resulting weights must be correct. Can we say that 
training is correct and accept the current version as the correct and 
functioning one?

2. After that there are a few options:
we could try to re-write the code in a way that will accept the newest 
version of opennlp maxent and all the following versions. I guess that would 
require changes in opennlp maxent too, but perhaps it is better than manual 
alteration of GISTrainer every time you update something, and both projects 
will benefit from such collaboration.

if not, perhaps for Hivemall as a project, we may consider re-writing 
iterative scaling from scratch to make it Hivemall efficient, perhaps using the 
tricks OpenNLP uses to make the code more efficient, and making sure that the 
resulting weights are comparable, but without aiming to being able to plug a 
new OpenNLP jar each time new version appears. 

What do you think?

Regards,
Elena.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-31 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r130332904
  
--- Diff: core/pom.xml ---
@@ -103,6 +103,12 @@
${guava.version}
provided

+   
+   opennlp
+   maxent
+   3.0.0
--- End diff --

Thank you for the comment. Is there a reason for chasing the latest 
version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 
training?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
"Changes Unknown when pulling b4383db on helenahm:master into ** on 
apache:master**."

What does that mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-24 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I have also looked at the resulting models last week. Formally, I added 
only one extra test to tests: hivemall.opennlp.tools.MaxEntPredictUDFTest.java. 

/**
 * Compare MaxEntropy in HiveMall with that of OpenNLP 3.0.0
 */
@Test
public void testResemblenceToOpenNLP() throws Exception {
}

The code inside the test gets access to internal model representation and 
compares feature weights. 

Using similar code I have looked at the feature weights in 3 models that 
were relevant for my dataset. All the models look reasonable, that is, key 
class features get high weights, and models predict what they should. One of 
the models is an aggregated model (aggregated from the 302 I have got from 
mappers). This one looks reasonable too.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-18 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Yes. I will do all that in the next few days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-10 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Sure. OpenNLP one will be scalable too. The code is still not perfect now, 
but (!) 

97304256 rows of data
that required 302 mappers on my machine, and 
set hive.execution.engine=mr;
set mapreduce.task.timeout=180;

no memory issues, I had 302 models back and aggregated them into one after.

Still have to work on the code and check whether the resulting model makes 
sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...

2017-07-05 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I think you are right about CSRMatrix (_x here). I should use it without 
converting it back to unscaleable structures via DataIndexer.

As I literary did by:

EventStream es = new MatrixEventStream(_x, _y, _attributes); 
AbstractModel model;
try {
model = GIS.trainModel(1000, new OnePassRealValueDataIndexer(es,0), 
_USE_SMOOTHING);
} catch (IOException e) {
throw new HiveException(e.getMessage());
}

I will re-write the code to be more scalable, by changing OpenNLP code to a 
code that uses Smile's and Hivemall's structures. I was mistaken about the ways 
OpenNLP processes the data from the EventStream. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model

2017-07-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I will do more tests too, as I actually need the model for a project. So I 
plan to test it under "load" too. I will write about the results.

It may have similar issues that Random Forest has. You are right. In a 
nutshell the implementation and memory concerns are similar. 

The implementation is as scalable as the implementation of Random Forest: 
one or more models per mapper and then a UDAF that combines all the learned 
models into one final model.

I still use the Random Forest even though on EMR r4 machines _numTrees 
greater than 1 does not work for me for my dataset. MaxEnt though will give me 
a better model, I think, I will not have to think whether there is overfitting 
because of the tree structure, etc.

Iterative Scaling can be re-written from scratch too without using any 
third-party software. This is an option too.

I am sure that NLP community will more likely accept the implementation and 
will use it in exactly the way those guys have written it. We very much value 
Adwait Ratnaparkhi's work. Many published articles use exactly that Max Ent 
implementation. That means that people will be able to use HiveMall and compare 
their newer results with results of their previous work.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-04 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r125552049
  
--- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java 
---
@@ -0,0 +1,440 @@
+package hivemall.smile.classification;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Callable;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.GuardedBy;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.MapredContext;
+import org.apache.hadoop.hive.ql.exec.MapredContextAccessor;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.Counters.Counter;
+
+import hivemall.UDTFWithOptions;
+import hivemall.math.matrix.Matrix;
+import hivemall.math.matrix.MatrixUtils;
+import hivemall.math.matrix.builders.CSRMatrixBuilder;
+import hivemall.math.matrix.builders.MatrixBuilder;
+import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder;
+import hivemall.math.matrix.ints.ColumnMajorIntMatrix;
+import hivemall.math.matrix.ints.DoKIntMatrix;
+import hivemall.math.matrix.ints.IntMatrix;
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.math.vector.Vector;
+import hivemall.math.vector.VectorProcedure;
+import hivemall.smile.classification.DecisionTree.SplitRule;
+import hivemall.smile.data.Attribute;
+import hivemall.smile.tools.MatrixEventStream;
+import hivemall.smile.tools.SepDelimitedTextGISModelWriter;
+import hivemall.smile.utils.SmileExtUtils;
+import hivemall.smile.utils.SmileTaskExecutor;
+import hivemall.utils.codec.Base91;
+import hivemall.utils.collections.lists.IntArrayList;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.hadoop.WritableUtils;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.lang.Primitives;
+import hivemall.utils.lang.RandomUtils;
+
+import opennlp.maxent.GIS;
+import opennlp.maxent.io.GISModelWriter;
+import opennlp.model.AbstractModel;
+import opennlp.model.Event;
+import opennlp.model.EventStream;
+import opennlp.model.OnePassRealValueDataIndexer;
+
+@Description(
+name = "train_maxent_classifier",
+value = "_FUNC_(array features, int label [, const boolean 
classification])"
++ " - Returns a maximum entropy model per subset of data.")
+@UDFType(deterministic = true, stateful = false)
+public class MaxEntUDTF extends UDTFWithOptions{
+   private static final Log logger = LogFactory.getLog(MaxEntUDTF.class);
+   
+   private ListObjectInspector featureListOI;
+private PrimitiveObjectInspector featureElemOI;
+private PrimitiveObjectInspector labelOI;
+
+private MatrixBuilder matrixBuilder;
+private IntArrayList labels;
+
+   private boolean _real;
+   private Attribute[] _attributes;
+   private static boolean _USE_SMOOTHING;
+   private double _SMOOTHING_OBSERVATION;
+   
+   private int _numTrees = 1;
+
+@Nullable
+private Reporter _progressReporter;
+@Nullable
+private Counter _treeBuildTaskCounter;
+
+@Override
+protected Options getOptions() {
+Options opts = new Options();
   

[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-04 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r125551962
  
--- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java 
---
@@ -0,0 +1,440 @@
+package hivemall.smile.classification;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Callable;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.GuardedBy;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.MapredContext;
+import org.apache.hadoop.hive.ql.exec.MapredContextAccessor;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.Counters.Counter;
+
+import hivemall.UDTFWithOptions;
+import hivemall.math.matrix.Matrix;
+import hivemall.math.matrix.MatrixUtils;
+import hivemall.math.matrix.builders.CSRMatrixBuilder;
+import hivemall.math.matrix.builders.MatrixBuilder;
+import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder;
+import hivemall.math.matrix.ints.ColumnMajorIntMatrix;
+import hivemall.math.matrix.ints.DoKIntMatrix;
+import hivemall.math.matrix.ints.IntMatrix;
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.math.vector.Vector;
+import hivemall.math.vector.VectorProcedure;
+import hivemall.smile.classification.DecisionTree.SplitRule;
+import hivemall.smile.data.Attribute;
+import hivemall.smile.tools.MatrixEventStream;
+import hivemall.smile.tools.SepDelimitedTextGISModelWriter;
+import hivemall.smile.utils.SmileExtUtils;
+import hivemall.smile.utils.SmileTaskExecutor;
+import hivemall.utils.codec.Base91;
+import hivemall.utils.collections.lists.IntArrayList;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.hadoop.WritableUtils;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.lang.Primitives;
+import hivemall.utils.lang.RandomUtils;
+
+import opennlp.maxent.GIS;
+import opennlp.maxent.io.GISModelWriter;
+import opennlp.model.AbstractModel;
+import opennlp.model.Event;
+import opennlp.model.EventStream;
+import opennlp.model.OnePassRealValueDataIndexer;
+
+@Description(
+name = "train_maxent_classifier",
+value = "_FUNC_(array features, int label [, const boolean 
classification])"
++ " - Returns a maximum entropy model per subset of data.")
+@UDFType(deterministic = true, stateful = false)
+public class MaxEntUDTF extends UDTFWithOptions{
+   private static final Log logger = LogFactory.getLog(MaxEntUDTF.class);
+   
+   private ListObjectInspector featureListOI;
+private PrimitiveObjectInspector featureElemOI;
+private PrimitiveObjectInspector labelOI;
+
+private MatrixBuilder matrixBuilder;
+private IntArrayList labels;
+
+   private boolean _real;
+   private Attribute[] _attributes;
+   private static boolean _USE_SMOOTHING;
+   private double _SMOOTHING_OBSERVATION;
+   
+   private int _numTrees = 1;
+
+@Nullable
+private Reporter _progressReporter;
+@Nullable
+private Counter _treeBuildTaskCounter;
+
+@Override
+protected Options getOptions() {
+Options opts = new Options();
   

[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-04 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r125551895
  
--- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java 
---
@@ -0,0 +1,440 @@
+package hivemall.smile.classification;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Callable;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.GuardedBy;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.MapredContext;
+import org.apache.hadoop.hive.ql.exec.MapredContextAccessor;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.serde2.io.DoubleWritable;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.Counters.Counter;
+
+import hivemall.UDTFWithOptions;
+import hivemall.math.matrix.Matrix;
+import hivemall.math.matrix.MatrixUtils;
+import hivemall.math.matrix.builders.CSRMatrixBuilder;
+import hivemall.math.matrix.builders.MatrixBuilder;
+import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder;
+import hivemall.math.matrix.ints.ColumnMajorIntMatrix;
+import hivemall.math.matrix.ints.DoKIntMatrix;
+import hivemall.math.matrix.ints.IntMatrix;
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.math.vector.Vector;
+import hivemall.math.vector.VectorProcedure;
+import hivemall.smile.classification.DecisionTree.SplitRule;
+import hivemall.smile.data.Attribute;
+import hivemall.smile.tools.MatrixEventStream;
+import hivemall.smile.tools.SepDelimitedTextGISModelWriter;
+import hivemall.smile.utils.SmileExtUtils;
+import hivemall.smile.utils.SmileTaskExecutor;
+import hivemall.utils.codec.Base91;
+import hivemall.utils.collections.lists.IntArrayList;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.hadoop.WritableUtils;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.lang.Primitives;
+import hivemall.utils.lang.RandomUtils;
+
+import opennlp.maxent.GIS;
+import opennlp.maxent.io.GISModelWriter;
+import opennlp.model.AbstractModel;
+import opennlp.model.Event;
+import opennlp.model.EventStream;
+import opennlp.model.OnePassRealValueDataIndexer;
+
+@Description(
+name = "train_maxent_classifier",
+value = "_FUNC_(array features, int label [, const boolean 
classification])"
++ " - Returns a maximum entropy model per subset of data.")
+@UDFType(deterministic = true, stateful = false)
+public class MaxEntUDTF extends UDTFWithOptions{
+   private static final Log logger = LogFactory.getLog(MaxEntUDTF.class);
+   
+   private ListObjectInspector featureListOI;
+private PrimitiveObjectInspector featureElemOI;
+private PrimitiveObjectInspector labelOI;
+
+private MatrixBuilder matrixBuilder;
+private IntArrayList labels;
+
+   private boolean _real;
+   private Attribute[] _attributes;
+   private static boolean _USE_SMOOTHING;
+   private double _SMOOTHING_OBSERVATION;
+   
+   private int _numTrees = 1;
+
+@Nullable
+private Reporter _progressReporter;
+@Nullable
+private Counter _treeBuildTaskCounter;
+
+@Override
+protected Options getOptions() {
+Options opts = new Options();
   

[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I have completed all the small code changes you have commented on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
The official page for OpenNLP MaxEnt:
http://maxent.sourceforge.net/about.html

Models choice is 100% NLP influenced.

The opennlp.maxent package was originally built by Jason Baldridge, Tom 
Morton, and Gann Bierner.  We owe a big thanks to Adwait Ratnaparkhi for his 
work on maximum entropy models for natural language processing applications.  
His introduction to maxent for NLP and dissertation are what really made 
opennlp.maxent and our Grok maxent components (POS tagger, end of sentence 
detector, tokenizer, name finder) possible! 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
In Hivemall-126: 

Max Entropy Classifier (a.k.a. Multi-nominal/Multiclass Logistic 
Regression) [1,2] is useful for Text classification.

Max Entropy Classifier is more often used for Part-of-Speech Tagging and 
Named Entity Recognition, and some other tasks where context is used as 
features. Those are also fundamental tasks of NLP. Even though Text 
Classification is a candidate too.

Mohri as his colleagues also put POS task first. As Mohri writes in the 
article that is a basis for the implementation I have chosen:

Our first set of experiments were carried out with “medium” scale data 
sets containing 1M-300Minstances.
These included: English part-of-speech tagging, generated from the Penn 
Treebank
[16] using the first character of each part-of-speech tag as output, 
sections 2-21 for training, section
23 for testing and a feature representation based on the identity, affixes, 
and orthography of the input
word and the words in a window of size two; Sentiment analysis, generated 
from a set of
online product, service, and merchant reviews with a three-label output 
(positive, negative, neutral),
with a bag of words feature representation; RCV1-v2 as described by [14], 
where documents having
multiple labels were included multiple times, once for each label; Acoustic 
Speech Data, a 39-
dimensional input consisting of 13 PLP coefficients, plus their first and 
second derivatives, and 129
outputs (43 phones × 3 acoustic states); and the Deja News Archive, a text 
topic classification
problem generated from a collection of Usenet discussion forums from the 
years 1995-2000. For all
text experiments, we used random feature mixing [9, 20] to control the size 
of the feature space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
I have just put myself as a watcher on Hivemall-126. And run mvn formatter. 
The formatter made changers in heaps of files. Shall I commit all the changes? 
Or would it better to pinpoint my own files only? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-03 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
•Could you create a JIRA issue for  MaxEntropy classifier  and rename the 
PR title to  [WIP][HIVEMALL-xxx] . ?

How do I do that?

•Could you apply Hivemall code formatting? You can use  mvn 
formatter:format  or this style file for Eclipse/IntelliJ.

just mvn formatter:format  ?  How does it work? Is there something in the 
pom that tells maven how to re-format files?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-03 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
Just added a MaxEntMixtureWeightUDAF to aggregate weights of models 
obtained on each part of data.
create temporary function aggregate_classifiers as 
'hivemall.smile.tools.MaxEntMixtureWeightUDAF';


tested it on EMR only:

add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as 
'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as 
'hivemall.smile.tools.MaxEntPredictUDF';
create temporary function aggregate_classifiers as 
'hivemall.smile.tools.MaxEntMixtureWeightUDAF';
select aggregate_classifiers(model) from tmodel5;

Where tmodel5 contains 5 lines of same model. 

I will do more testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #93: Maximum Entropy Model

2017-07-01 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/93
  
And please, disregard the LDAUDTFTest.java. We have already discussed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: Maximum Entropy Model

2017-07-01 Thread helenahm
GitHub user helenahm opened a pull request:

https://github.com/apache/incubator-hivemall/pull/93

Maximum Entropy Model

## What changes were proposed in this pull request?

A Distributed Max Entropy Model

## What type of PR is it?

Feature

## What is the Jira issue?

?

## How was this patch tested?

There are two tests at  the moment, 
hivemall.smile.classification.MaxEntUDTFTest.java
and hivemall.smile.tools.TreePredictUDFTest.java

plus I have tested the code on EMR:

add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as 
'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as 
'hivemall.smile.tools.MaxEntPredictUDF';
drop table tmodel_maxent;
CREATE TABLE tmodel_maxent 
STORED AS SEQUENCEFILE 
AS
select 
  train_maxent_classifier(features, klass, "-attrs 


Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")
 
from
  t_test_maxent;

create table tmodel_combined as
select model, attributes, features, klass from t_test_maxent join 
tmodel_maxent;

create table tmodel_predicted as
select
predict_maxent_classifier(model, attributes, features) result, klass from 
tmodel_combined;

Source table:
drop table t_test_maxent;
create table t_test_maxent as select
array( 
x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
cast(tWord(x37) as double),
cast(tWord(x38) as double),
cast(tWord(x39) as double),
cast(tWord(x40) as double),
cast(tWord(x41) as double),
cast(tWord(x42) as double),
cast(tWord(x43) as double),
cast(tWord(x44) as double),
cast(contentWord(x45) as double),
cast(contentWord(x46) as double),
cast(contentWord(x47) as double),
cast(contentWord(x48) as double),
cast(contentWord(x49) as double),
cast(contentWord(x50) as double),
cast(contentWord(x51) as double),
cast(contentWord(x52) as double),
cast(contentWord(x53) as double),
cast(presentationWord(x54) as double),
cast(presentationWord(x55) as double),
cast(presentationWord(x56) as double),
cast(presentationWord(x57) as double),
cast(presentationWord(x58) as double),
cast(presentationWord(x59) as double),
cast(presentationWord(x60) as double),
cast(presentationWord(x61) as double),
cast(presentationWord(x62) as double),
x63,x64,x65,x66,x67,x68,x69,x70) features
, klass from pdfs_and_tiffs_instances_combined_instances where 
regexp_replace(tp, 'T', '') == '76_698_855_347';


## How to use this feature?

Maximum Entropy Classifier is, from my point of view, the most useful 
classification technique for many NLP tasks and many other tasks that are not 
related to NLP. It is used for part of speech tagging, NER, and some other 
tasks.

I have been searching for a distributed version of it and found one article 
only that talks about it. "Efficient Large Scale Distributed Training of 
Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his 
colleagues at Google. (Please, let me know how I can send you the article if 
you will not get it by googling). Thus, I think it is time to implement that. I 
plan to use Mixture Weight Method they describe.

By now a final udaf is still to be implemented (the one that collects all 
the models and averages the weights), that I plan to commit next week. 

See if you like the idea and will accept the code. It is based on Apache 
maxent, that is open source and is written in a simple way.

Regards,
Elena.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/helenahm/incubator-hivemall master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 45a656aa7278066ce3fc36fcd81fb1eca11f1079
Author: helenahm 
Date:   2017-06-02T05:10:13Z

Update LDAUDTFTest.java

commit fef9c1ce719d3924a28cc90d71d40728dc5c7563
Author: helenahm 
Date:   2017-06-02T05:22:54Z

Merge pull request #1 from helenahm/helenahm-patch-1

Update LDAUDTFTest.java

commit e92b13aa3cb4fc193ea3da3fadd8a8fe8a6a073b
Author: AKHMATOVA, Elena 
Date:   2017-07-02T03:41:14Z

maxent

commit d4031550f80007045353f1e24e58c99244ab3db3
Author: AK

[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.

2017-06-04 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/82
  
Yes, thank you very very much, the compiled code works without any errors! 
:-) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.

2017-06-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/82
  
You are right though, my bad, the error comes from null rows only. Sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.

2017-06-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/82
  
i am pretty sure that in my case the document is empty, and not null, first 
of all because Ln != "" saved that query, not Ln is not null; also it is a 
table obtained as an external table, so if anything nulls would look like 
"null", that is a non empty string.

It is up to you, but as for the user of hivemall, LDA is at the end of a 
much longer pipeline, and nulls and empty strings will not be uncommon, your 
users may be upset to have such an error. 

Also, as a user, after building a model and running predict, I would feel 
good if null and empty strings would get a topic, -1, or "none", or "Na". I do 
not know how it works now, but I think processing nulls and empty strings would 
make people feel better about the hivemall library.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.

2017-06-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/82
  
Also, in my query (below), I had to add ln != null, to not submit empty 
text, otherwise an error (below below) is thrown

create temporary function train_lda as 'hivemall.topicmodel.LDAUDTF';
with word_counts as (
  select
cast(rownum as int) as docid,
feature(word, count(word)) as word_count
  from data_rownum t1 LATERAL VIEW explode(tokenize(ln, true)) t2 as word
  where
not is_stopword(word) and ln != ""  
group by
rownum, word
)
select
  label, word, avg(lambda) as lambda
from (
  select
train_lda(feature, "-topics 2 -iter 20") as (label, word, lambda)
  from (
select docid, collect_set(word_count) as feature
from word_counts
group by docid
  ) t1
) t2
group by label, word
order by lambda desc
;


Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row {"rownum":"179","ln":null}
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
Error while processing row {"rownum":"179","ln":null}
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:499)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to 
execute method public java.util.List 
hivemall.tools.text.TokenizeUDF.evaluate(org.apache.hadoop.io.Text,boolean)  on 
object hivemall.tools.text.TokenizeUDF@6240651f of class 
hivemall.tools.text.TokenizeUDF with arguments {null, true:java.lang.Boolean} 
of size 2
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:991)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at 
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at 
org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator.process(LateralViewForwardOperator.java:39)
at 
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at 
org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:126)
at 
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:967)
... 22 more
Caused by: java.lang.NullPointerException
at hivemall.tools.text.TokenizeUDF.evaluate(TokenizeUDF.java:42)
... 26 more



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.

2017-06-02 Thread helenahm
Github user helenahm commented on the issue:

https://github.com/apache/incubator-hivemall/pull/82
  
Thank you for the library you are developing. Please, let me know when the 
code is ready to be pulled to build a new hivemall-core jar and use it for LDA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #82: Encoding related bug in LDA.

2017-06-01 Thread helenahm
GitHub user helenahm opened a pull request:

https://github.com/apache/incubator-hivemall/pull/82

Encoding related bug in LDA.

What changes were proposed in this pull request?

I have found a major bug, without fixing it people will not be able to use 
LDA, or perhaps other algorithms too.

I spotted the bug and made a small fix, I would like you to fix the rest.

What type of PR is it?

[Bug Fix]

What is the Jira issue?

???

How was this patch tested?

I intended to use LDA for my data, on EMR as usually, LDA failed to process 
my text. So I checked your test, and when added code to add my data to your 
test instead of your two lines. When it run successfully, I realized that the 
test may be faulty, and indeed, I think close() is called in real life, but not 
in the test. Same errors as on EMR showed up.

I found the lines in features that coursed the errors:
 na‹ve:1
 xž:1

Why I do not know, but that means that I have to pre-process data prior to 
testing LDA further, plus I start doubting whether it will work for other 
languages.

EMR error messages for different options for memory and number of reduces 
are below. Same source, same reason.

Diagnostic Messages for this Task:
 Error: java.lang.RuntimeException: Hive Runtime Error while closing 
operators: Exception caused in the iterative training
 at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
caused in the iterative training
 at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:511)
 at hivemall.topicmodel.LDAUDTF.close(LDAUDTF.java:309)
 at 
org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:683)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
 at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
 ... 7 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:352)
 ... 13 more

Error: java.lang.RuntimeException: Hive Runtime Error while closing 
operators: Exception caused in the iterative training
 at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
caused in the iterative training
 at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:511)
 at hivemall.topicmodel.LDAUDTF.close(LDAUDTF.java:309)
 at 
org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:683)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
 at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
 ... 7 more
 Caused by: java.nio.BufferUnderflowException
 at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:271)
 at java.nio.ByteBuffer.get(ByteBuffer.java:715)
 at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:356)
 ... 13 more

Error: java.lang.RuntimeException: Hive Runtime Error while closing 
operators: Exception caused in the iterative training
 at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
 at java.security.AccessController.doPrivileged(Native Method)
 at