[GitHub] incubator-hivemall issue #105: [WIP][HIVEMALL-24] Scalable field-aware facto...

2017-07-31 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/105
  
```sql
SET mapred.max.split.size=6400; -- use more mappers to avoid OOM in 
mappers
-- Hive on Tez
SET tez.task.resource.memory.mb=3072;
SET hive.tez.java.opts=-server -Xmx2560m -XX:+PrintGCDetails -XX:+UseNUMA 
-XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError;
-- Hive on MapReuce
SET mapreduce.map.memory.mb=3072;
SET mapreduce.map.java.opts=-server -Xmx2560m -XX:+PrintGCDetails 
-XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError;

-- 46 mappers
INSERT OVERWRITE TABLE ffm_model
select
  train_ffm(features, label, "-c -iters 10 -factors 4 -feature_hashing 20") 
-- 2^20 = 1,048,576 features
from
  train
;
```

```
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
hivemall.utils.collections.maps.Int2LongOpenHashTable.rehash(Int2LongOpenHashTable.java:320)
at 
hivemall.utils.collections.maps.Int2LongOpenHashTable.ensureCapacity(Int2LongOpenHashTable.java:310)
at 
hivemall.utils.collections.maps.Int2LongOpenHashTable.preAddEntry(Int2LongOpenHashTable.java:202)
at 
hivemall.utils.collections.maps.Int2LongOpenHashTable.put(Int2LongOpenHashTable.java:146)
at 
hivemall.fm.FFMStringFeatureMapModel.getV(FFMStringFeatureMapModel.java:168)
at 
hivemall.fm.FieldAwareFactorizationMachineModel.predict(FieldAwareFactorizationMachineModel.java:89)
at 
hivemall.fm.FieldAwareFactorizationMachineUDTF.trainTheta(FieldAwareFactorizationMachineUDTF.java:194)
at 
hivemall.fm.FieldAwareFactorizationMachineUDTF.train(FieldAwareFactorizationMachineUDTF.java:184)
at 
hivemall.fm.FactorizationMachineUDTF.process(FactorizationMachineUDTF.java:285)
at 
org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:116)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:878)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:878)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:360)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:160)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-31 Thread kottmann
Github user kottmann commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r130337244
  
--- Diff: core/pom.xml ---
@@ -103,6 +103,12 @@
${guava.version}
provided

+   
+   opennlp
+   maxent
+   3.0.0
--- End diff --

We fixed a couple of bugs over time, and added new features, usual 
maintenance (e.g. testing on recent java versions),  so yeah, it probably makes 
sense to use a recent version when you build something new. Also it supports 
now multi-threaded training.  

The version you are linking above is 7 years old.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-31 Thread helenahm
Github user helenahm commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r130332904
  
--- Diff: core/pom.xml ---
@@ -103,6 +103,12 @@
${guava.version}
provided

+   
+   opennlp
+   maxent
+   3.0.0
--- End diff --

Thank you for the comment. Is there a reason for chasing the latest 
version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 
training?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #93: [WIP][HIVEMALL-126] Maximum Entropy Mod...

2017-07-31 Thread kottmann
Github user kottmann commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/93#discussion_r130319727
  
--- Diff: core/pom.xml ---
@@ -103,6 +103,12 @@
${guava.version}
provided

+   
+   opennlp
+   maxent
+   3.0.0
--- End diff --

Consider using a recent version of OpenNLP instead. The maxent code was 
moved into opennlp-tools (latest version is 1.8.1). The version you use here is 
from a couple of years ago.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (HIVEMALL-137) Support SMOTE oversampling for unbalanced data

2017-07-31 Thread Makoto Yui (JIRA)
Makoto Yui created HIVEMALL-137:
---

 Summary: Support SMOTE oversampling for unbalanced data
 Key: HIVEMALL-137
 URL: https://issues.apache.org/jira/browse/HIVEMALL-137
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Makoto Yui


https://www.jair.org/media/953/live-953-2037-jair.pdf
http://contrib.scikit-learn.org/imbalanced-learn/generated/imblearn.over_sampling.SMOTE.html
http://qiita.com/shima_x/items/370587304ef17e7a61b8 (in Japanese)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)