[GitHub] incubator-hivemall issue #105: [WIP][HIVEMALL-24] Scalable field-aware facto...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/105 ```sql SET mapred.max.split.size=6400; -- use more mappers to avoid OOM in mappers -- Hive on Tez SET tez.task.resource.memory.mb=3072; SET hive.tez.java.opts=-server -Xmx2560m -XX:+PrintGCDetails -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError; -- Hive on MapReuce SET mapreduce.map.memory.mb=3072; SET mapreduce.map.java.opts=-server -Xmx2560m -XX:+PrintGCDetails -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError; -- 46 mappers INSERT OVERWRITE TABLE ffm_model select train_ffm(features, label, "-c -iters 10 -factors 4 -feature_hashing 20") -- 2^20 = 1,048,576 features from train ; ``` ``` Caused by: java.lang.OutOfMemoryError: Java heap space at hivemall.utils.collections.maps.Int2LongOpenHashTable.rehash(Int2LongOpenHashTable.java:320) at hivemall.utils.collections.maps.Int2LongOpenHashTable.ensureCapacity(Int2LongOpenHashTable.java:310) at hivemall.utils.collections.maps.Int2LongOpenHashTable.preAddEntry(Int2LongOpenHashTable.java:202) at hivemall.utils.collections.maps.Int2LongOpenHashTable.put(Int2LongOpenHashTable.java:146) at hivemall.fm.FFMStringFeatureMapModel.getV(FFMStringFeatureMapModel.java:168) at hivemall.fm.FieldAwareFactorizationMachineModel.predict(FieldAwareFactorizationMachineModel.java:89) at hivemall.fm.FieldAwareFactorizationMachineUDTF.trainTheta(FieldAwareFactorizationMachineUDTF.java:194) at hivemall.fm.FieldAwareFactorizationMachineUDTF.train(FieldAwareFactorizationMachineUDTF.java:184) at hivemall.fm.FactorizationMachineUDTF.process(FactorizationMachineUDTF.java:285) at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:116) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:878) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:878) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:360) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:160) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user kottmann commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r130337244 --- Diff: core/pom.xml --- @@ -103,6 +103,12 @@ ${guava.version} provided + + opennlp + maxent + 3.0.0 --- End diff -- We fixed a couple of bugs over time, and added new features, usual maintenance (e.g. testing on recent java versions), so yeah, it probably makes sense to use a recent version when you build something new. Also it supports now multi-threaded training. The version you are linking above is 7 years old. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user helenahm commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r130332904 --- Diff: core/pom.xml --- @@ -103,6 +103,12 @@ ${guava.version} provided + + opennlp + maxent + 3.0.0 --- End diff -- Thank you for the comment. Is there a reason for chasing the latest version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 training? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...
Github user kottmann commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/93#discussion_r130319727 --- Diff: core/pom.xml --- @@ -103,6 +103,12 @@ ${guava.version} provided + + opennlp + maxent + 3.0.0 --- End diff -- Consider using a recent version of OpenNLP instead. The maxent code was moved into opennlp-tools (latest version is 1.8.1). The version you use here is from a couple of years ago. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (HIVEMALL-137) Support SMOTE oversampling for unbalanced data
Makoto Yui created HIVEMALL-137: --- Summary: Support SMOTE oversampling for unbalanced data Key: HIVEMALL-137 URL: https://issues.apache.org/jira/browse/HIVEMALL-137 Project: Hivemall Issue Type: Sub-task Reporter: Makoto Yui https://www.jair.org/media/953/live-953-2037-jair.pdf http://contrib.scikit-learn.org/imbalanced-learn/generated/imblearn.over_sampling.SMOTE.html http://qiita.com/shima_x/items/370587304ef17e7a61b8 (in Japanese) -- This message was sent by Atlassian JIRA (v6.4.14#64029)