[GitHub] incubator-hivemall issue #105: [WIP][HIVEMALL-24] Scalable field-aware facto...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/105 memory consumption of FFM is estimated as follows: ``` ( 4 + 4*factors + 8+ 4+8) * fields * features (bytes) ~~ ~ ~~~~~ Wi V[k] adagrad ftrl ffm (4+4*4*8+4+8)*39*2^20 bytes = 5.88 GiB ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user kottmann commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 @myui that was done for the 1.6.0 release, and in maxent 3.0.3 it was modified to run in multiple threads. You probably need to take a similar approach as we took for multi-threaded training e.g. split the amount of work done per iteration and scale it out to multiple machines, merge the parameters, and repeat for the next iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 @kottmann Do you know in which version maxent classifier is moved to opennlp-tools? Versioning scheme of opennlp-maxent and opennlp-tools modules are very different. https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-maxent https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-tools --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 @helenahm I agree to use Hivemall's Matrix to reduce memory consumption and create a custom BigGISTrainer for Hivemall. My concern is that the modification can be based on the latest release of Apache OpenNLP, `v1.8.1` if there are no reason to use pre-apache release. Anyway, I look into your PR after merging https://github.com/apache/incubator-hivemall/pull/105 Maybe in the next week. Some refactoring would be applied (such as removing debug prints and unused codes) forking your PR branch. BTW, multi-thresholding should be avoided when running a task in a Yarn container. Better to be parallelized by Hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user kottmann commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 @helenahm as far as I know the training data is stored once in memory, and then for each thread a copy of the parameters is stored. Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation. The actual cause is that it won't scale beyond one machine. Bottom line if you want to use GIS training with lots of data don't use this implementation, the training requires a certain amount of CPU time and it increases with the amount of training data. In case you manage to make this run with much more data the time it will take to run will be uncomfortably high. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/107 Also, some other DDLs also needed to be updated. Please grep `tree_export` to know which DDLs to update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/107 @nzw Could you update user guide to include `fmeasure` and `f1score` in `incubator-hivemall/docs/gitbook/eval/classification_measures.md` ? `npm install gitbook-cli; gitbook install; gitbook serve` on docs/gitbook . Also, could you revise the current Evaluation section of https://treasure-data.gyazo.com/5ec4b737dcedd55353f8126040ea5366 to ``` ⢠Binary Classification metrics ⢠Area Under the ROC Curve ⢠Regression metrics ⢠Ranking metrics ``` Refer examples in http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics https://turi.com/learn/userguide/evaluation/classification.html#f_scores --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Github user helenahm commented on the issue: https://github.com/apache/incubator-hivemall/pull/93 It will include some work. Let me explain. You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed, they store data in [][] and few times. Using their code directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.) And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix and [][] holding weights to DoKMatrix. To explain that more, it is best to look at source code for the GISTrainer. In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below. Newer GISTrainer: https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java Older (3.0.0) GISTrainer: https://sourceforge.net/projects/maxent/files/ - whole achive GISTrainer attached: [GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt) Hivemall GISTrainer: https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java), that contains references to Matrix, and outcomes. This is CSRMatrix. And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][]. when ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), di.getOMap()); (they use this convenience Event thing to work with a row of data. Instead of storing a List of Events in memory the modified code also builds an event when needed.) and results are stored in Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] again. GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would, of course, consider the existing version as 1.0, and did all the effort to adapt GISTrainer later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue after that to understanding training process in greater details and perhaps write a newer comparable trainer that will be independent from OpenNLP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/107#discussion_r130797590 --- Diff: resources/ddl/define-all.hive --- @@ -543,8 +543,8 @@ create temporary function lr_datagen as 'hivemall.dataset.LogisticRegressionData -- Evaluating functions -- -- -drop temporary function if exists f1score; -create temporary function f1score as 'hivemall.evaluation.FMeasureUDAF'; +drop temporary function if exists fmeasure; +create temporary function fmeasure as 'hivemall.evaluation.FMeasureUDAF'; --- End diff -- Could you remain alias for `f1score` in DDLs for backward compatibility. ```sql -- alias for backward compatibility drop temporary function if exists f1score; create temporary function f1score as 'hivemall.evaluation.FMeasureUDAF'; drop temporary function if exists fmeasure; ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---