[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 @myui I share the concern that the modification can be based on the latest release of Apache OpenNLP, v1.8.1 if there are no reason to use pre-apache release. If I knew about the newer version of maxent at the very beginning I would have used it. I will examine the newer maxent code in the next few days. As you said, have a look at the PR when you have time. And then a decision what to do can be made. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 "Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation." - it was the big problem for me to use on Hadoop and that is why i had to alter the training code - the newer version of the code is as bad as the old one from this point of view "The actual cause is that it won't scale beyond one machine." - yes, that is why I really like what Hivemall project is about, and that is why i needed MaxEnt for Hive "In case you manage to make this run with much more data the time it will take to run will be uncomfortably high." -- that is why i have tested my new implementation on almost 100 mils of training samples and saw each of 302 mappers finish work in very reasonable time --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 It will include some work. Let me explain. You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed, they store data in [][] and few times. Using their code directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.) And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix and [][] holding weights to DoKMatrix. To explain that more, it is best to look at source code for the GISTrainer. In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below. Newer GISTrainer: https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java Older (3.0.0) GISTrainer: https://sourceforge.net/projects/maxent/files/ - whole achive GISTrainer attached: [GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt) Hivemall GISTrainer: https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java), that contains references to Matrix, and outcomes. This is CSRMatrix. And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][]. when ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), di.getOMap()); (they use this convenience Event thing to work with a row of data. Instead of storing a List of Events in memory the modified code also builds an event when needed.) and results are stored in Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] again. GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would, of course, consider the existing version as 1.0, and did all the effort to adapt GISTrainer later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue after that to understanding training process in greater details and perhaps write a newer comparable trainer that will be independent from OpenNLP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I think the code is ready to be checked and pulled now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r130533347 --- Diff: core/pom.xml --- @@ -103,6 +103,12 @@ ${guava.version} provided + + opennlp + maxent + 3.0.0 --- End diff -- In general I totally agree. I think it would be good to perform the move to another version of maxent in a few steps. 1. The code I have re-used is that of GISTrainer. That is more or less updating the weights in a matrix where matrix is hivemall's matrix. Everything else is just following your class structure. I have checked that the resulting models are the same and I have also confirmed that the resulting model makes sense on my own data. So the resulting weights must be correct. Can we say that training is correct and accept the current version as the correct and functioning one? 2. After that there are a few options: we could try to re-write the code in a way that will accept the newest version of opennlp maxent and all the following versions. I guess that would require changes in opennlp maxent too, but perhaps it is better than manual alteration of GISTrainer every time you update something, and both projects will benefit from such collaboration. if not, perhaps for Hivemall as a project, we may consider re-writing iterative scaling from scratch to make it Hivemall efficient, perhaps using the tricks OpenNLP uses to make the code more efficient, and making sure that the resulting weights are comparable, but without aiming to being able to plug a new OpenNLP jar each time new version appears. What do you think? Regards, Elena. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r130332904 --- Diff: core/pom.xml --- @@ -103,6 +103,12 @@ ${guava.version} provided + + opennlp + maxent + 3.0.0 --- End diff -- Thank you for the comment. Is there a reason for chasing the latest version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 training? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 "Changes Unknown when pulling b4383db on helenahm:master into ** on apache:master**." What does that mean? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I have also looked at the resulting models last week. Formally, I added only one extra test to tests: hivemall.opennlp.tools.MaxEntPredictUDFTest.java. /** * Compare MaxEntropy in HiveMall with that of OpenNLP 3.0.0 */ @Test public void testResemblenceToOpenNLP() throws Exception { } The code inside the test gets access to internal model representation and compares feature weights. Using similar code I have looked at the feature weights in 3 models that were relevant for my dataset. All the models look reasonable, that is, key class features get high weights, and models predict what they should. One of the models is an aggregated model (aggregated from the 302 I have got from mappers). This one looks reasonable too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 Yes. I will do all that in the next few days. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 Sure. OpenNLP one will be scalable too. The code is still not perfect now, but (!) 97304256 rows of data that required 302 mappers on my machine, and set hive.execution.engine=mr; set mapreduce.task.timeout=180; no memory issues, I had 302 models back and aggregated them into one after. Still have to work on the code and check whether the resulting model makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I think you are right about CSRMatrix (_x here). I should use it without converting it back to unscaleable structures via DataIndexer. As I literary did by: EventStream es = new MatrixEventStream(_x, _y, _attributes); AbstractModel model; try { model = GIS.trainModel(1000, new OnePassRealValueDataIndexer(es,0), _USE_SMOOTHING); } catch (IOException e) { throw new HiveException(e.getMessage()); } I will re-write the code to be more scalable, by changing OpenNLP code to a code that uses Smile's and Hivemall's structures. I was mistaken about the ways OpenNLP processes the data from the EventStream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I will do more tests too, as I actually need the model for a project. So I plan to test it under "load" too. I will write about the results. It may have similar issues that Random Forest has. You are right. In a nutshell the implementation and memory concerns are similar. The implementation is as scalable as the implementation of Random Forest: one or more models per mapper and then a UDAF that combines all the learned models into one final model. I still use the Random Forest even though on EMR r4 machines _numTrees greater than 1 does not work for me for my dataset. MaxEnt though will give me a better model, I think, I will not have to think whether there is overfitting because of the tree structure, etc. Iterative Scaling can be re-written from scratch too without using any third-party software. This is an option too. I am sure that NLP community will more likely accept the implementation and will use it in exactly the way those guys have written it. We very much value Adwait Ratnaparkhi's work. Many published articles use exactly that Max Ent implementation. That means that people will be able to use HiveMall and compare their newer results with results of their previous work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r125552049 --- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java --- @@ -0,0 +1,440 @@ +package hivemall.smile.classification; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; +import java.util.concurrent.atomic.AtomicInteger; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import javax.annotation.concurrent.GuardedBy; + +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.ql.exec.MapredContext; +import org.apache.hadoop.hive.ql.exec.MapredContextAccessor; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.io.DoubleWritable; +import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils; +import org.apache.hadoop.io.IntWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.Counters.Counter; + +import hivemall.UDTFWithOptions; +import hivemall.math.matrix.Matrix; +import hivemall.math.matrix.MatrixUtils; +import hivemall.math.matrix.builders.CSRMatrixBuilder; +import hivemall.math.matrix.builders.MatrixBuilder; +import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder; +import hivemall.math.matrix.ints.ColumnMajorIntMatrix; +import hivemall.math.matrix.ints.DoKIntMatrix; +import hivemall.math.matrix.ints.IntMatrix; +import hivemall.math.random.PRNG; +import hivemall.math.random.RandomNumberGeneratorFactory; +import hivemall.math.vector.Vector; +import hivemall.math.vector.VectorProcedure; +import hivemall.smile.classification.DecisionTree.SplitRule; +import hivemall.smile.data.Attribute; +import hivemall.smile.tools.MatrixEventStream; +import hivemall.smile.tools.SepDelimitedTextGISModelWriter; +import hivemall.smile.utils.SmileExtUtils; +import hivemall.smile.utils.SmileTaskExecutor; +import hivemall.utils.codec.Base91; +import hivemall.utils.collections.lists.IntArrayList; +import hivemall.utils.hadoop.HiveUtils; +import hivemall.utils.hadoop.WritableUtils; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.lang.Primitives; +import hivemall.utils.lang.RandomUtils; + +import opennlp.maxent.GIS; +import opennlp.maxent.io.GISModelWriter; +import opennlp.model.AbstractModel; +import opennlp.model.Event; +import opennlp.model.EventStream; +import opennlp.model.OnePassRealValueDataIndexer; + +@Description( +name = "train_maxent_classifier", +value = "_FUNC_(array features, int label [, const boolean classification])" ++ " - Returns a maximum entropy model per subset of data.") +@UDFType(deterministic = true, stateful = false) +public class MaxEntUDTF extends UDTFWithOptions{ + private static final Log logger = LogFactory.getLog(MaxEntUDTF.class); + + private ListObjectInspector featureListOI; +private PrimitiveObjectInspector featureElemOI; +private PrimitiveObjectInspector labelOI; + +private MatrixBuilder matrixBuilder; +private IntArrayList labels; + + private boolean _real; + private Attribute[] _attributes; + private static boolean _USE_SMOOTHING; + private double _SMOOTHING_OBSERVATION; + + private int _numTrees = 1; + +@Nullable +private Reporter _progressReporter; +@Nullable +private Counter _treeBuildTaskCounter; + +@Override +protected Options getOptions() { +Options opts = new Options();
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r125551962 --- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java --- @@ -0,0 +1,440 @@ +package hivemall.smile.classification; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; +import java.util.concurrent.atomic.AtomicInteger; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import javax.annotation.concurrent.GuardedBy; + +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.ql.exec.MapredContext; +import org.apache.hadoop.hive.ql.exec.MapredContextAccessor; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.io.DoubleWritable; +import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils; +import org.apache.hadoop.io.IntWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.Counters.Counter; + +import hivemall.UDTFWithOptions; +import hivemall.math.matrix.Matrix; +import hivemall.math.matrix.MatrixUtils; +import hivemall.math.matrix.builders.CSRMatrixBuilder; +import hivemall.math.matrix.builders.MatrixBuilder; +import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder; +import hivemall.math.matrix.ints.ColumnMajorIntMatrix; +import hivemall.math.matrix.ints.DoKIntMatrix; +import hivemall.math.matrix.ints.IntMatrix; +import hivemall.math.random.PRNG; +import hivemall.math.random.RandomNumberGeneratorFactory; +import hivemall.math.vector.Vector; +import hivemall.math.vector.VectorProcedure; +import hivemall.smile.classification.DecisionTree.SplitRule; +import hivemall.smile.data.Attribute; +import hivemall.smile.tools.MatrixEventStream; +import hivemall.smile.tools.SepDelimitedTextGISModelWriter; +import hivemall.smile.utils.SmileExtUtils; +import hivemall.smile.utils.SmileTaskExecutor; +import hivemall.utils.codec.Base91; +import hivemall.utils.collections.lists.IntArrayList; +import hivemall.utils.hadoop.HiveUtils; +import hivemall.utils.hadoop.WritableUtils; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.lang.Primitives; +import hivemall.utils.lang.RandomUtils; + +import opennlp.maxent.GIS; +import opennlp.maxent.io.GISModelWriter; +import opennlp.model.AbstractModel; +import opennlp.model.Event; +import opennlp.model.EventStream; +import opennlp.model.OnePassRealValueDataIndexer; + +@Description( +name = "train_maxent_classifier", +value = "_FUNC_(array features, int label [, const boolean classification])" ++ " - Returns a maximum entropy model per subset of data.") +@UDFType(deterministic = true, stateful = false) +public class MaxEntUDTF extends UDTFWithOptions{ + private static final Log logger = LogFactory.getLog(MaxEntUDTF.class); + + private ListObjectInspector featureListOI; +private PrimitiveObjectInspector featureElemOI; +private PrimitiveObjectInspector labelOI; + +private MatrixBuilder matrixBuilder; +private IntArrayList labels; + + private boolean _real; + private Attribute[] _attributes; + private static boolean _USE_SMOOTHING; + private double _SMOOTHING_OBSERVATION; + + private int _numTrees = 1; + +@Nullable +private Reporter _progressReporter; +@Nullable +private Counter _treeBuildTaskCounter; + +@Override +protected Options getOptions() { +Options opts = new Options();
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r125551895 --- Diff: core/src/main/java/hivemall/smile/classification/MaxEntUDTF.java --- @@ -0,0 +1,440 @@ +package hivemall.smile.classification; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; +import java.util.concurrent.atomic.AtomicInteger; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import javax.annotation.concurrent.GuardedBy; + +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.ql.exec.MapredContext; +import org.apache.hadoop.hive.ql.exec.MapredContextAccessor; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.io.DoubleWritable; +import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils; +import org.apache.hadoop.io.IntWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.Counters.Counter; + +import hivemall.UDTFWithOptions; +import hivemall.math.matrix.Matrix; +import hivemall.math.matrix.MatrixUtils; +import hivemall.math.matrix.builders.CSRMatrixBuilder; +import hivemall.math.matrix.builders.MatrixBuilder; +import hivemall.math.matrix.builders.RowMajorDenseMatrixBuilder; +import hivemall.math.matrix.ints.ColumnMajorIntMatrix; +import hivemall.math.matrix.ints.DoKIntMatrix; +import hivemall.math.matrix.ints.IntMatrix; +import hivemall.math.random.PRNG; +import hivemall.math.random.RandomNumberGeneratorFactory; +import hivemall.math.vector.Vector; +import hivemall.math.vector.VectorProcedure; +import hivemall.smile.classification.DecisionTree.SplitRule; +import hivemall.smile.data.Attribute; +import hivemall.smile.tools.MatrixEventStream; +import hivemall.smile.tools.SepDelimitedTextGISModelWriter; +import hivemall.smile.utils.SmileExtUtils; +import hivemall.smile.utils.SmileTaskExecutor; +import hivemall.utils.codec.Base91; +import hivemall.utils.collections.lists.IntArrayList; +import hivemall.utils.hadoop.HiveUtils; +import hivemall.utils.hadoop.WritableUtils; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.lang.Primitives; +import hivemall.utils.lang.RandomUtils; + +import opennlp.maxent.GIS; +import opennlp.maxent.io.GISModelWriter; +import opennlp.model.AbstractModel; +import opennlp.model.Event; +import opennlp.model.EventStream; +import opennlp.model.OnePassRealValueDataIndexer; + +@Description( +name = "train_maxent_classifier", +value = "_FUNC_(array features, int label [, const boolean classification])" ++ " - Returns a maximum entropy model per subset of data.") +@UDFType(deterministic = true, stateful = false) +public class MaxEntUDTF extends UDTFWithOptions{ + private static final Log logger = LogFactory.getLog(MaxEntUDTF.class); + + private ListObjectInspector featureListOI; +private PrimitiveObjectInspector featureElemOI; +private PrimitiveObjectInspector labelOI; + +private MatrixBuilder matrixBuilder; +private IntArrayList labels; + + private boolean _real; + private Attribute[] _attributes; + private static boolean _USE_SMOOTHING; + private double _SMOOTHING_OBSERVATION; + + private int _numTrees = 1; + +@Nullable +private Reporter _progressReporter; +@Nullable +private Counter _treeBuildTaskCounter; + +@Override +protected Options getOptions() { +Options opts = new Options();
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I have completed all the small code changes you have commented on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 The official page for OpenNLP MaxEnt: http://maxent.sourceforge.net/about.html Models choice is 100% NLP influenced. The opennlp.maxent package was originally built by Jason Baldridge, Tom Morton, and Gann Bierner. We owe a big thanks to Adwait Ratnaparkhi for his work on maximum entropy models for natural language processing applications. His introduction to maxent for NLP and dissertation are what really made opennlp.maxent and our Grok maxent components (POS tagger, end of sentence detector, tokenizer, name finder) possible! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 In Hivemall-126: Max Entropy Classifier (a.k.a. Multi-nominal/Multiclass Logistic Regression) [1,2] is useful for Text classification. Max Entropy Classifier is more often used for Part-of-Speech Tagging and Named Entity Recognition, and some other tasks where context is used as features. Those are also fundamental tasks of NLP. Even though Text Classification is a candidate too. Mohri as his colleagues also put POS task first. As Mohri writes in the article that is a basis for the implementation I have chosen: Our first set of experiments were carried out with âmediumâ scale data sets containing 1M-300Minstances. These included: English part-of-speech tagging, generated from the Penn Treebank [16] using the first character of each part-of-speech tag as output, sections 2-21 for training, section 23 for testing and a feature representation based on the identity, affixes, and orthography of the input word and the words in a window of size two; Sentiment analysis, generated from a set of online product, service, and merchant reviews with a three-label output (positive, negative, neutral), with a bag of words feature representation; RCV1-v2 as described by [14], where documents having multiple labels were included multiple times, once for each label; Acoustic Speech Data, a 39- dimensional input consisting of 13 PLP coefficients, plus their first and second derivatives, and 129 outputs (43 phones à 3 acoustic states); and the Deja News Archive, a text topic classification problem generated from a collection of Usenet discussion forums from the years 1995-2000. For all text experiments, we used random feature mixing [9, 20] to control the size of the feature space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 I have just put myself as a watcher on Hivemall-126. And run mvn formatter. The formatter made changers in heaps of files. Shall I commit all the changes? Or would it better to pinpoint my own files only? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 â¢Could you create a JIRA issue for MaxEntropy classifier and rename the PR title to [WIP][HIVEMALL-xxx] . ? How do I do that? â¢Could you apply Hivemall code formatting? You can use mvn formatter:format or this style file for Eclipse/IntelliJ. just mvn formatter:format ? How does it work? Is there something in the pom that tells maven how to re-format files? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 Just added a MaxEntMixtureWeightUDAF to aggregate weights of models obtained on each part of data. create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF'; tested it on EMR only: add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar; add jar opennlp-maxent-3.0.0.jar; source define-all.hive; create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF'; create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF'; create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF'; select aggregate_classifiers(model) from tmodel5; Where tmodel5 contains 5 lines of same model. I will do more testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: Maximum Entropy Model
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 And please, disregard the LDAUDTFTest.java. We have already discussed that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: Maximum Entropy Model
GitHub user helenahm opened a pull request: https://github.com/apache/incubator-hivemall/pull/93 Maximum Entropy Model ## What changes were proposed in this pull request? A Distributed Max Entropy Model ## What type of PR is it? Feature ## What is the Jira issue? ? ## How was this patch tested? There are two tests at the moment, hivemall.smile.classification.MaxEntUDTFTest.java and hivemall.smile.tools.TreePredictUDFTest.java plus I have tested the code on EMR: add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar; add jar opennlp-maxent-3.0.0.jar; source define-all.hive; create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF'; create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF'; drop table tmodel_maxent; CREATE TABLE tmodel_maxent STORED AS SEQUENCEFILE AS select train_maxent_classifier(features, klass, "-attrs Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q") from t_test_maxent; create table tmodel_combined as select model, attributes, features, klass from t_test_maxent join tmodel_maxent; create table tmodel_predicted as select predict_maxent_classifier(model, attributes, features) result, klass from tmodel_combined; Source table: drop table t_test_maxent; create table t_test_maxent as select array( x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36, cast(tWord(x37) as double), cast(tWord(x38) as double), cast(tWord(x39) as double), cast(tWord(x40) as double), cast(tWord(x41) as double), cast(tWord(x42) as double), cast(tWord(x43) as double), cast(tWord(x44) as double), cast(contentWord(x45) as double), cast(contentWord(x46) as double), cast(contentWord(x47) as double), cast(contentWord(x48) as double), cast(contentWord(x49) as double), cast(contentWord(x50) as double), cast(contentWord(x51) as double), cast(contentWord(x52) as double), cast(contentWord(x53) as double), cast(presentationWord(x54) as double), cast(presentationWord(x55) as double), cast(presentationWord(x56) as double), cast(presentationWord(x57) as double), cast(presentationWord(x58) as double), cast(presentationWord(x59) as double), cast(presentationWord(x60) as double), cast(presentationWord(x61) as double), cast(presentationWord(x62) as double), x63,x64,x65,x66,x67,x68,x69,x70) features , klass from pdfs_and_tiffs_instances_combined_instances where regexp_replace(tp, 'T', '') == '76_698_855_347'; ## How to use this feature? Maximum Entropy Classifier is, from my point of view, the most useful classification technique for many NLP tasks and many other tasks that are not related to NLP. It is used for part of speech tagging, NER, and some other tasks. I have been searching for a distributed version of it and found one article only that talks about it. "Efficient Large Scale Distributed Training of Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his colleagues at Google. (Please, let me know how I can send you the article if you will not get it by googling). Thus, I think it is time to implement that. I plan to use Mixture Weight Method they describe. By now a final udaf is still to be implemented (the one that collects all the models and averages the weights), that I plan to commit next week. See if you like the idea and will accept the code. It is based on Apache maxent, that is open source and is written in a simple way. Regards, Elena. You can merge this pull request into a Git repository by running: $ git pull https://github.com/helenahm/incubator-hivemall master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 45a656aa7278066ce3fc36fcd81fb1eca11f1079 Author: helenahm Date: 2017-06-02T05:10:13Z Update LDAUDTFTest.java commit fef9c1ce719d3924a28cc90d71d40728dc5c7563 Author: helenahm Date: 2017-06-02T05:22:54Z Merge pull request #1 from helenahm/helenahm-patch-1 Update LDAUDTFTest.java commit e92b13aa3cb4fc193ea3da3fadd8a8fe8a6a073b Author: AKHMATOVA, Elena Date: 2017-07-02T03:41:14Z maxent commit d4031550f80007045353f1e24e58c99244ab3db3 Author: AK
[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/82 Yes, thank you very very much, the compiled code works without any errors! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/82 You are right though, my bad, the error comes from null rows only. Sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/82 i am pretty sure that in my case the document is empty, and not null, first of all because Ln != "" saved that query, not Ln is not null; also it is a table obtained as an external table, so if anything nulls would look like "null", that is a non empty string. It is up to you, but as for the user of hivemall, LDA is at the end of a much longer pipeline, and nulls and empty strings will not be uncommon, your users may be upset to have such an error. Also, as a user, after building a model and running predict, I would feel good if null and empty strings would get a topic, -1, or "none", or "Na". I do not know how it works now, but I think processing nulls and empty strings would make people feel better about the hivemall library. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/82 Also, in my query (below), I had to add ln != null, to not submit empty text, otherwise an error (below below) is thrown create temporary function train_lda as 'hivemall.topicmodel.LDAUDTF'; with word_counts as ( select cast(rownum as int) as docid, feature(word, count(word)) as word_count from data_rownum t1 LATERAL VIEW explode(tokenize(ln, true)) t2 as word where not is_stopword(word) and ln != "" group by rownum, word ) select label, word, avg(lambda) as lambda from ( select train_lda(feature, "-topics 2 -iter 20") as (label, word, lambda) from ( select docid, collect_set(word_count) as feature from word_counts group by docid ) t1 ) t2 group by label, word order by lambda desc ; Diagnostic Messages for this Task: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"rownum":"179","ln":null} at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"rownum":"179","ln":null} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:499) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160) ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.util.List hivemall.tools.text.TokenizeUDF.evaluate(org.apache.hadoop.io.Text,boolean) on object hivemall.tools.text.TokenizeUDF@6240651f of class hivemall.tools.text.TokenizeUDF with arguments {null, true:java.lang.Boolean} of size 2 at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:991) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator.process(LateralViewForwardOperator.java:39) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:126) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:967) ... 22 more Caused by: java.lang.NullPointerException at hivemall.tools.text.TokenizeUDF.evaluate(TokenizeUDF.java:42) ... 26 more --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/82 Thank you for the library you are developing. Please, let me know when the code is ready to be pulled to build a new hivemall-core jar and use it for LDA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #82: Encoding related bug in LDA.
GitHub user helenahm opened a pull request: https://github.com/apache/incubator-hivemall/pull/82 Encoding related bug in LDA. What changes were proposed in this pull request? I have found a major bug, without fixing it people will not be able to use LDA, or perhaps other algorithms too. I spotted the bug and made a small fix, I would like you to fix the rest. What type of PR is it? [Bug Fix] What is the Jira issue? ??? How was this patch tested? I intended to use LDA for my data, on EMR as usually, LDA failed to process my text. So I checked your test, and when added code to add my data to your test instead of your two lines. When it run successfully, I realized that the test may be faulty, and indeed, I think close() is called in real life, but not in the test. Same errors as on EMR showed up. I found the lines in features that coursed the errors: naâ¹ve:1 xž:1 Why I do not know, but that means that I have to pre-process data prior to testing LDA further, plus I start doubting whether it will work for other languages. EMR error messages for different options for memory and number of reduces are below. Same source, same reason. Diagnostic Messages for this Task: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Exception caused in the iterative training at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Exception caused in the iterative training at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:511) at hivemall.topicmodel.LDAUDTF.close(LDAUDTF.java:309) at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:683) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279) ... 7 more Caused by: java.lang.OutOfMemoryError: Java heap space at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:352) ... 13 more Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Exception caused in the iterative training at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Exception caused in the iterative training at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:511) at hivemall.topicmodel.LDAUDTF.close(LDAUDTF.java:309) at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:683) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279) ... 7 more Caused by: java.nio.BufferUnderflowException at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:271) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at hivemall.topicmodel.LDAUDTF.runIterativeTraining(LDAUDTF.java:356) ... 13 more Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Exception caused in the iterative training at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:454) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at