[GitHub] incubator-hivemall pull request #175: [WIP][HIVEMALL-230] Revise Optimizer I...

2018-12-12 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/175

[WIP][HIVEMALL-230] Revise Optimizer Implementation

## What changes were proposed in this pull request?

Revise Optimizer implementation. 

1. Revise default hyperparameters of AdaDelta and Adam. 
2. Support AdamW, AdamHD, Eve, and YellowFin optimizer.

* Fixing Weight Decay Regularization in Adam
https://openreview.net/forum?id=rk6qdGgCZ
* On the Convergence of Adam and Beyond 
https://openreview.net/forum?id=ryQu7f-RZ
* AdamHD (Adam with Hypergradient descent)
https://arxiv.org/pdf/1703.04782.pdf
• Eve: A Gradient Based Optimization Method with Locally and Globally 
Adaptive Learning Rates
https://arxiv.org/abs/1611.01505
• YellowFin and the Art of Momentum Tuning
https://arxiv.org/abs/1706.03471

## What type of PR is it?

Improvement, Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-230

## How was this patch tested?

unit tests, emr (to appear)

## How to use this feature?

to appear

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall adam_test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #175


commit 5168cf06bf03c38f005d435a4415ce8cb8140891
Author: Makoto Yui 
Date:   2018-12-03T07:04:29Z

Added ongoing unit test files

commit ed1b6302183a687a3584fe62ce5fa92b26c828ad
Author: Makoto Yui 
Date:   2018-12-04T09:41:42Z

Fixed to show ETA in debug log

commit 5c9d63f9fc184f05eed28f03986c6269c4ea6e93
Author: Makoto Yui 
Date:   2018-12-04T09:42:02Z

Added unit tests

commit 243f4b40899b960f4942c75f89c0c4c94974b03b
Author: Makoto Yui 
Date:   2018-12-05T09:48:17Z

Added comments

commit ae29e9a669dcd311b154615e19900ec4b01fd4d8
Author: Makoto Yui 
Date:   2018-12-06T07:08:48Z

Refactored

commit c25ce02db537570c6ed75db74d9a3783b316c694
Author: Makoto Yui 
Date:   2018-12-06T07:10:05Z

Added square() method

commit 71671d10138aa54c0485809b6126753a54dbe3e8
Author: Makoto Yui 
Date:   2018-12-06T07:10:42Z

Added helper methods

commit 6f4edbbaaac37884533132dea00c81f36da45e50
Author: Makoto Yui 
Date:   2018-12-06T07:22:51Z

Refactored ADAM implementation

commit e61f22afaa46bdf705c2760cebaa601929a77608
Author: Makoto Yui 
Date:   2018-12-06T08:52:08Z

Added logging message

commit 22c3f7c132fc01528c93c6e15d40a2b70f1771c0
Author: Makoto Yui 
Date:   2018-12-06T08:53:01Z

Improved -eta option to take eta0 for Fixed ETA estimator

commit e9b9b1420c3b573b5cbe15e4340d862251fac81d
Author: Makoto Yui 
Date:   2018-12-06T08:53:28Z

Added unit test

commit 7c6e4a1da5eaeb99c02a9a83f1519d5274131037
Author: Makoto Yui 
Date:   2018-12-06T09:06:16Z

Made eta default hyper-parameter flexible for each optimizer

commit a92293906d43c25ce47032644774723a0cf713d9
Author: Makoto Yui 
Date:   2018-12-06T09:36:26Z

Changed the default hyperparameter of AdaDelta

commit 1494ea298497a846650b2d9f6799add77105ae77
Author: Makoto Yui 
Date:   2018-12-07T05:03:21Z

Reduced the size of test data

commit 79197a84ca4d840ab3150730d5e6d4a5ad96e719
Author: Makoto Yui 
Date:   2018-12-07T05:39:13Z

Improved -help option handling

commit 4fdcf6c84ec81c174f5e107038660b1200b1a9a5
Author: Makoto Yui 
Date:   2018-12-07T05:48:07Z

Added assertions

commit e1c7a68df679a65f496268bd4acc286b19d0a964
Author: Makoto Yui 
Date:   2018-12-07T07:39:58Z

Fixed AdaDelta eta to 1.0

commit b8e5698ecd7e7d2758ef85a338c053f5bbcc663d
Author: Makoto Yui 
Date:   2018-12-07T09:13:48Z

Supported -amsgrad in Adam

commit aa512c3b71039f97c2ac08b598fcb11f1cfc4d80
Author: Makoto Yui 
Date:   2018-12-07T09:59:59Z

Supported -decay option in ADAM optimizer

commit 19bd276ff9867ba93f42c241feb9aa5aafd0836c
Author: Makoto Yui 
Date:   2018-12-07T10:15:24Z

Revise the default eta0/alpha value

commit 19fa61145e8be18c3f86988905b35f171e1ee50e
Author: Makoto Yui 
Date:   2018-12-10T08:37:05Z

Revised ADAM hyperparameter treatment




---


[GitHub] incubator-hivemall pull request #173: [HIVEMALL-227][DOC] Removed md5 and re...

2018-11-15 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/173

[HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new 
ASF policy

## What changes were proposed in this pull request?

Removed md5 and replace sha1 with sha512 following new ASF policy

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-227


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-227

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit 583eb9991cf8db730d46b431b1cb80ebaeb293a8
Author: Makoto Yui 
Date:   2018-11-15T09:18:39Z

Removed md5 and replace sha1 with sha512 following new ASF policy




---


[GitHub] incubator-hivemall issue #171: [SPARK][HOTFIX] Fix the existing test failure...

2018-11-14 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/171
  
Merged. Thanks!


---


[GitHub] incubator-hivemall issue #172: Fix typo

2018-11-13 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/172
  
Merged, thanks! 


---


[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...

2018-11-13 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/171#discussion_r233324312
  
--- Diff: 
spark/spark-2.3/src/test/scala/org/apache/spark/sql/hive/XGBoostSuite.scala ---
@@ -77,6 +77,7 @@ final class XGBoostSuite extends VectorQueryTest {
 val model = 
hiveContext.sparkSession.read.format("libxgboost").load(tempDir)
 val predict = model.join(mllibTestDf)
   .xgboost_predict($"rowid", $"features", $"model_id", 
$"pred_model")
--- End diff --

BTW, could you paste Stacktrace of the exception?


---


[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...

2018-11-13 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/171#discussion_r233288186
  
--- Diff: spark/pom.xml ---
@@ -52,6 +52,12 @@
hivemall-core
${project.version}
compile
+   
+   
+   io.netty
+   
netty-all
+   
--- End diff --

ah... I see.


---


[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...

2018-11-13 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/171#discussion_r233287740
  
--- Diff: 
spark/spark-2.3/src/main/scala/org/apache/spark/sql/hive/HivemallOps.scala ---
@@ -1935,18 +1935,6 @@ object HivemallOps {
 )
   }
 
-  /**
-   * @see [[hivemall.tools.array.SubarrayUDF]]
-   * @group tools.array
-   */
-  def subarray(original: Column, fromIndex: Column, toIndex: Column): 
Column = withExpr {
-planHiveUDF(
-  "hivemall.tools.array.SubarrayUDF",
-  "subarray",
-  original :: fromIndex :: toIndex :: Nil
-)
-  }
--- End diff --

Replacing SubarrayUDF with  ArraySliceUDF is not easy?

```
def subarray(original: Column, fromIndex: Column, length: Column): Column = 
withExpr {  
planHiveUDF(
  "hivemall.tools.array.ArraySliceUDF",
```


---


[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...

2018-11-13 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/171#discussion_r233287092
  
--- Diff: 
spark/spark-2.3/src/test/scala/org/apache/spark/sql/hive/XGBoostSuite.scala ---
@@ -77,6 +77,7 @@ final class XGBoostSuite extends VectorQueryTest {
 val model = 
hiveContext.sparkSession.read.format("libxgboost").load(tempDir)
 val predict = model.join(mllibTestDf)
   .xgboost_predict($"rowid", $"features", $"model_id", 
$"pred_model")
--- End diff --

Let's disable xgboost for spark-2.3.


---


[GitHub] incubator-hivemall pull request #170: [WIP][HIVEMALL-223] Add -kv_map and -v...

2018-11-11 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/170

[WIP][HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF

## What changes were proposed in this pull request?

Add `-kv_map` and `-vk_map` option to `to_ordered_list` UDAF.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-223

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

Will be described in 
http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-223

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #170


commit 26f361ce7b355410772577f0754f4bb5537ababf
Author: Makoto Yui 
Date:   2018-11-12T04:19:37Z

Added -kv_map and -vk_map option

commit 39ee911cb12e63f924229e962bbb00247297f75d
Author: Makoto Yui 
Date:   2018-11-12T04:20:13Z

Added WIP unit tests for -kv_map/vk_map option of to_ordered_list UDAF




---


[GitHub] incubator-hivemall issue #163: [HIVEMALL-196] Support BM25 scoring

2018-11-02 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/163
  
@jaxony Merged with some modification. Thank you for your first 
contribution to Apache Hivemall!


---


[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...

2018-10-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/168
  
We might need to set asf mirror to avoid timeout by the default ASF 
repository.

https://maven.apache.org/guides/mini/guide-mirror-settings.html
https://code.i-harness.com/ja/q/c326f0


---


[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...

2018-10-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/168
  
```
[WARNING] Could not transfer metadata 
org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml
 from/to apache.snapshots (https://repository.apache.org/snapshots): Connect to 
repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: 
Connection timed out (Connection timed out)
[WARNING] Failure to transfer 
org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml
 from https://repository.apache.org/snapshots/ was cached in the local 
repository, resolution will not be reattempted until the update interval of 
apache-snapshots has elapsed or updates are forced. Original error: Could not 
transfer metadata 
org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml
 from/to apache-snapshots (https://repository.apache.org/snapshots/): Connect 
to repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: 
Connection timed out (Connection timed out)
[WARNING] Failure to transfer 
org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml
 from https://repository.apache.org/snapshots was cached in the local 
repository, resolution will not be reattempted until the update interval of 
apache.snapshots has elapsed or updates are forced. Original error: Could not 
transfer metadata 
org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml
 from/to apache.snapshots (https://repository.apache.org/snapshots): Connect to 
repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: 
Connection timed out (Connection timed out)
[INFO] Downloading from apache-snapshots: 
https://repository.apache.org/snapshots/org/apache/hivemall/hivemall-spark2.1/0.5.1-incubating-SNAPSHOT/hivemall-spark2.1-0.5.1-incubating-SNAPSHOT-sources.jar
[INFO] Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hivemall/hivemall-spark2.1/0.5.1-incubating-SNAPSHOT/hivemall-spark2.1-0.5.1-incubating-SNAPSHOT-sources.jar
```

hmm could we provide mirror repository in travis-ci ?


---


[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...

2018-10-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/168
  
See what happens. 


---


[GitHub] incubator-hivemall pull request #168: [HIVEMALL-221] Add cache to reduce Mav...

2018-10-24 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/168#discussion_r227797186
  
--- Diff: .travis.yml ---
@@ -35,7 +40,7 @@ notifications:
   email: false
 
 script:
-  - ./bin/run_travis_tests.sh
+  - travis_wait 10 ./bin/run_travis_tests.sh
--- End diff --

plz revert this change because this does not effect 


---


[GitHub] incubator-hivemall pull request #168: [HIVEMALL-221] Add cache to reduce Mav...

2018-10-24 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/168#discussion_r227796760
  
--- Diff: .travis.yml ---
@@ -1,5 +1,10 @@
 sudo: false
 
+cache:
+  timeout: 1500
+  directories:
+  - $HOME/.m2
--- End diff --

Isn't `$HOME/.m2/repository` ?

https://github.com/apache/kafka/blob/trunk/.travis.yml#L52
https://github.com/airlift/drift/blob/master/.travis.yml#L11
https://github.com/mesos/storm/blob/master/.travis.yml#L6


---


[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...

2018-10-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/168
  
@maropu 

Is this `clean` required?

https://github.com/apache/incubator-hivemall/blob/master/bin/run_travis_tests.sh#L42


---


[GitHub] incubator-hivemall pull request #169: [HIVEMALL-222] Introduce Gradient Clip...

2018-10-24 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/169

[HIVEMALL-222] Introduce Gradient Clipping to avoid exploding gradient to 
General Classifier/Regressor

## What changes were proposed in this pull request?

Avoid [exploding 
gradients](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15%20Exploding%20and%20Vanishing%20Gradients.pdf)
 by gradient clipping (by value)

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-222

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall clipping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #169


commit 0c10392d2a3c96b40df57e6b406333e0a239b9f9
Author: Makoto Yui 
Date:   2018-10-24T08:14:15Z

Updated for debugging purpose

commit e0dc4b954650c6751d6e37ee5ecf6c9656872b16
Author: Makoto Yui 
Date:   2018-10-24T08:15:03Z

Introduced gradient clipping by value to avoid exploding gradients

commit 7e932e99cfd990bb47ff7acfed44c19678fadc8f
Author: Makoto Yui 
Date:   2018-10-24T08:15:52Z

Added a unit test for gradient clipping




---


[GitHub] incubator-hivemall issue #168: Add cache to reduce Maven build time on Travi...

2018-10-23 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/168
  
Seems not working.. 

`timeout: 1000` helps (?)
https://docs.travis-ci.com/user/caching/#setting-the-timeout

Please add `[HIVEMALL-221]` to the PR title.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-20 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226845079
  
--- Diff: core/src/main/java/hivemall/fm/Feature.java ---
@@ -383,4 +383,10 @@ public static void l2normalize(@Nonnull final 
Feature[] features) {
 }
 }
 
+@Override
--- End diff --

See 
https://medium.com/codelog/overriding-hashcode-method-effective-java-notes-723c1fedf51c
 

Usually, overriding `equals` required `hashCode` because hashCode (and 
equals) is used for HashMap key search.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226579427
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+
+@Nonnegative
+private float maxInitValue;
+@Nonnegative
+private double initStdDev;
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+
+}
+
+@Nonnegative
+private final int factor;
+
+// rank matrix initialization
+private final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private final Map theta;
+private final Map beta;
+private final Object2DoubleMap betaBias;
+private final Map gamma;
+private final Object2DoubleMap gammaBias;
+
+private final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+// solve
+private final RealMatrix B;
+private final RealVector A;
+
+// error message strings
+private static final String ARRAY_NOT_SQUARE_ERR = "Array is not 
square";
+private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or 
array do not match in size";
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ float c0, float c1, float lambdaTheta, float 
lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.betaBias.defaultReturnValue(0.d)

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578817
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+
+@Nonnegative
+private float maxInitValue;
+@Nonnegative
+private double initStdDev;
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+
+}
+
+@Nonnegative
+private final int factor;
+
+// rank matrix initialization
+private final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private final Map theta;
+private final Map beta;
+private final Object2DoubleMap betaBias;
+private final Map gamma;
+private final Object2DoubleMap gammaBias;
+
+private final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+// solve
+private final RealMatrix B;
+private final RealVector A;
+
+// error message strings
+private static final String ARRAY_NOT_SQUARE_ERR = "Array is not 
square";
+private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or 
array do not match in size";
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ float c0, float c1, float lambdaTheta, float 
lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.betaBias.defaultReturnValue(0.d)

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578559
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+
+@Nonnegative
+private float maxInitValue;
+@Nonnegative
+private double initStdDev;
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+
+}
+
+@Nonnegative
+private final int factor;
+
+// rank matrix initialization
+private final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private final Map theta;
+private final Map beta;
+private final Object2DoubleMap betaBias;
+private final Map gamma;
+private final Object2DoubleMap gammaBias;
+
+private final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+// solve
+private final RealMatrix B;
+private final RealVector A;
+
+// error message strings
+private static final String ARRAY_NOT_SQUARE_ERR = "Array is not 
square";
+private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or 
array do not match in size";
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ float c0, float c1, float lambdaTheta, float 
lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.betaBias.defaultReturnValue(0.d)

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578495
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
--- End diff --

please remove unnessesary line breaks.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226579051
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+
+@Nonnegative
+private float maxInitValue;
+@Nonnegative
+private double initStdDev;
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+
+}
+
+@Nonnegative
+private final int factor;
+
+// rank matrix initialization
+private final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private final Map theta;
+private final Map beta;
+private final Object2DoubleMap betaBias;
+private final Map gamma;
+private final Object2DoubleMap gammaBias;
+
+private final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+// solve
+private final RealMatrix B;
+private final RealVector A;
+
+// error message strings
+private static final String ARRAY_NOT_SQUARE_ERR = "Array is not 
square";
+private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or 
array do not match in size";
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ float c0, float c1, float lambdaTheta, float 
lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.betaBias.defaultReturnValue(0.d)

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578854
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,715 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.annotations.VisibleForTesting;
+import hivemall.fm.Feature;
+import hivemall.utils.lang.Preconditions;
+import hivemall.utils.math.MathUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+
+@Nonnegative
+private float maxInitValue;
+@Nonnegative
+private double initStdDev;
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+
+}
+
+@Nonnegative
+private final int factor;
+
+// rank matrix initialization
+private final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private final Map theta;
+private final Map beta;
+private final Object2DoubleMap betaBias;
+private final Map gamma;
+private final Object2DoubleMap gammaBias;
+
+private final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+// solve
+private final RealMatrix B;
+private final RealVector A;
+
+// error message strings
+private static final String ARRAY_NOT_SQUARE_ERR = "Array is not 
square";
+private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or 
array do not match in size";
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ float c0, float c1, float lambdaTheta, float 
lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.betaBias.defaultReturnValue(0.d)

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-19 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578153
  
--- Diff: core/src/main/java/hivemall/fm/Feature.java ---
@@ -383,4 +383,10 @@ public static void l2normalize(@Nonnull final 
Feature[] features) {
 }
 }
 
+@Override
--- End diff --

Why this `equals` method is required? Assume this is not used.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226525857
  
--- Diff: core/src/main/java/hivemall/mf/CofactorizationUDTF.java ---
@@ -0,0 +1,574 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.UDTFWithOptions;
+import hivemall.common.ConversionState;
+import hivemall.fm.Feature;
+import hivemall.fm.StringFeature;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.io.FileUtils;
+import hivemall.utils.io.NioStatefulSegment;
+import hivemall.utils.lang.NumberUtils;
+import hivemall.utils.lang.Primitives;
+import hivemall.utils.lang.SizeOf;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.serde2.objectinspector.*;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
+import org.apache.hadoop.mapred.Counters;
+import org.apache.hadoop.mapred.Reporter;
+
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+
+import static hivemall.utils.lang.Primitives.FALSE_BYTE;
+import static hivemall.utils.lang.Primitives.TRUE_BYTE;
+
+public class CofactorizationUDTF extends UDTFWithOptions {
+private static final Log LOG = 
LogFactory.getLog(CofactorizationUDTF.class);
+
+// Option variables
+// The number of latent factors
+protected int factor;
+// The scaling hyperparameter for zero entries in the rank matrix
+protected float scale_zero;
+// The scaling hyperparameter for non-zero entries in the rank matrix
+protected float scale_nonzero;
+// The preferred size of the miniBatch for training
+protected int batchSize;
+// The initial mean rating
+protected float globalBias;
+// Whether update (and return) the mean rating or not
+protected boolean updateGlobalBias;
+// The number of iterations
+protected int maxIters;
+// Whether to use bias clause
+protected boolean useBiasClause;
+// Whether to use normalization
+protected boolean useL2Norm;
+// regularization hyperparameters
+protected float lambdaTheta;
+protected float lambdaBeta;
+protected float lambdaGamma;
+
+// Initialization strategy of rank matrix
+protected CofactorModel.RankInitScheme rankInit;
+
+// Model itself
+protected CofactorModel model;
+protected int numItems;
+
+// Variable managing status of learning
+
+// The number of processed training examples
+protected long count;
+
+protected ConversionState cvState;
+private ConversionState validationState;
+
+// Input OIs and Context
+protected StringObjectInspector contextOI;
+protected ListObjectInspector featuresOI;
+protected BooleanObjectInspector isItemOI;
+protected ListObjectInspector sppmiOI;
+
+// Used for iterations
+protected NioStatefulSegment fileIO;
+protected ByteBuffer inputBuf;
+private long lastWritePos;
+
+private Feature contextProbe;
+private Feature[] featuresProbe;
+private Feature[] sppmiProbe;
+private boolean isItemProbe;
+private long numValidations;
+private long numTraining;
+
 

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226247247
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,640 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap;
+import it.unimi.dsi.fastutil.objects.Object2DoubleMap;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Object2DoubleMap betaBias;
+private Map gamma;
+private Object2DoubleMap gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new Object2DoubleArrayMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new Object2DoubleArrayMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) 

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226243032
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,638 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
+switch (initScheme) {
+case random:

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226241124
  
--- Diff: core/src/main/java/hivemall/mf/CofactorizationUDTF.java ---
@@ -0,0 +1,574 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.UDTFWithOptions;
+import hivemall.common.ConversionState;
+import hivemall.fm.Feature;
+import hivemall.fm.StringFeature;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.io.FileUtils;
+import hivemall.utils.io.NioStatefulSegment;
+import hivemall.utils.lang.NumberUtils;
+import hivemall.utils.lang.Primitives;
+import hivemall.utils.lang.SizeOf;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.serde2.objectinspector.*;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
+import org.apache.hadoop.mapred.Counters;
+import org.apache.hadoop.mapred.Reporter;
+
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+
+import static hivemall.utils.lang.Primitives.FALSE_BYTE;
+import static hivemall.utils.lang.Primitives.TRUE_BYTE;
+
+public class CofactorizationUDTF extends UDTFWithOptions {
+private static final Log LOG = 
LogFactory.getLog(CofactorizationUDTF.class);
+
+// Option variables
+// The number of latent factors
+protected int factor;
+// The scaling hyperparameter for zero entries in the rank matrix
+protected float scale_zero;
+// The scaling hyperparameter for non-zero entries in the rank matrix
+protected float scale_nonzero;
+// The preferred size of the miniBatch for training
+protected int batchSize;
+// The initial mean rating
+protected float globalBias;
+// Whether update (and return) the mean rating or not
+protected boolean updateGlobalBias;
+// The number of iterations
+protected int maxIters;
+// Whether to use bias clause
+protected boolean useBiasClause;
+// Whether to use normalization
+protected boolean useL2Norm;
+// regularization hyperparameters
+protected float lambdaTheta;
+protected float lambdaBeta;
+protected float lambdaGamma;
+
+// Initialization strategy of rank matrix
+protected CofactorModel.RankInitScheme rankInit;
+
+// Model itself
+protected CofactorModel model;
+protected int numItems;
+
+// Variable managing status of learning
+
+// The number of processed training examples
+protected long count;
+
+protected ConversionState cvState;
+private ConversionState validationState;
+
+// Input OIs and Context
+protected StringObjectInspector contextOI;
+protected ListObjectInspector featuresOI;
+protected BooleanObjectInspector isItemOI;
+protected ListObjectInspector sppmiOI;
+
+// Used for iterations
+protected NioStatefulSegment fileIO;
+protected ByteBuffer inputBuf;
+private long lastWritePos;
+
+private Feature contextProbe;
+private Feature[] featuresProbe;
+private Feature[] sppmiProbe;
+private boolean isItemProbe;
+private long numValidations;
+private long numTraining;
+
 

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226237654
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
+switch (initScheme) {
+case random:

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226239653
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
+switch (initScheme) {
+case random:

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226239017
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
+switch (initScheme) {
+case random:

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226204201
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
--- End diff --

```
final double[] v =

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226202891
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
+
+// precomputed identity matrix
+private RealMatrix identity;
+
+protected final Random[] randU, randI;
+
+// hyperparameters
+private final float c0, c1;
+private final float lambdaTheta, lambdaBeta, lambdaGamma;
+
+public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme 
initScheme,
+ @Nonnull float c0, @Nonnull float c1, float 
lambdaTheta,
+ float lambdaBeta, float lambdaGamma) {
+
+// rank init scheme is gaussian
+// 
https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98
+this.factor = factor;
+this.initScheme = initScheme;
+this.globalBias = 0.d;
+this.lambdaTheta = lambdaTheta;
+this.lambdaBeta = lambdaBeta;
+this.lambdaGamma = lambdaGamma;
+
+this.theta = new HashMap<>();
+this.beta = new HashMap<>();
+this.betaBias = new HashMap<>();
+this.gamma = new HashMap<>();
+this.gammaBias = new HashMap<>();
+
+this.randU = newRandoms(factor, 31L);
+this.randI = newRandoms(factor, 41L);
+
+checkHyperparameterC(c0);
+checkHyperparameterC(c1);
+this.c0 = c0;
+this.c1 = c1;
+
+}
+
+private void initFactorVector(String key, Map 
weights) {
+if (weights.containsKey(key)) {
+return;
+}
+RealVector v = new ArrayRealVector(factor);
+switch (initScheme) {
+case random:

[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226198983
  
--- Diff: core/src/main/java/hivemall/mf/FactorizedModel.java ---
@@ -30,25 +30,25 @@
 import javax.annotation.concurrent.NotThreadSafe;
 
 @NotThreadSafe
-public final class FactorizedModel {
+public class FactorizedModel {
--- End diff --

It seems FactorizedModel is not used in Cofactor. 

Is this change required? Revert if not used.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226199747
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
+private Map gamma;
+private Map gammaBias;
--- End diff --

Please use `Object2DoubleMap gammaBias` instead to reduce memory 
consumption.


---


[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor

2018-10-18 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/167#discussion_r226199666
  
--- Diff: core/src/main/java/hivemall/mf/CofactorModel.java ---
@@ -0,0 +1,629 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.mf;
+
+import hivemall.fm.Feature;
+import hivemall.utils.math.MathUtils;
+import hivemall.utils.math.MatrixUtils;
+import org.apache.commons.math3.linear.ArrayRealVector;
+import org.apache.commons.math3.linear.Array2DRowRealMatrix;
+import org.apache.commons.math3.linear.RealMatrix;
+import org.apache.commons.math3.linear.RealVector;
+import org.apache.commons.math3.linear.SingularValueDecomposition;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import java.util.*;
+
+public class CofactorModel {
+
+public enum RankInitScheme {
+random /* default */, gaussian;
+
+@Nonnegative
+protected float maxInitValue;
+@Nonnegative
+protected double initStdDev;
+
+@Nonnull
+public static CofactorModel.RankInitScheme resolve(@Nullable 
String opt) {
+if (opt == null) {
+return random;
+} else if ("gaussian".equalsIgnoreCase(opt)) {
+return gaussian;
+} else if ("random".equalsIgnoreCase(opt)) {
+return random;
+}
+return random;
+}
+
+public void setMaxInitValue(float maxInitValue) {
+this.maxInitValue = maxInitValue;
+}
+
+public void setInitStdDev(double initStdDev) {
+this.initStdDev = initStdDev;
+}
+
+}
+
+private static final int EXPECTED_SIZE = 136861;
+@Nonnegative
+protected final int factor;
+
+// rank matrix initialization
+protected final RankInitScheme initScheme;
+
+@Nonnull
+private double globalBias;
+
+// storing trainable latent factors and weights
+private Map theta;
+private Map beta;
+private Map betaBias;
--- End diff --

Please use `Object2DoubleMap betaBias` instead to reduce memory 
consumption.


---


[GitHub] incubator-hivemall pull request #166: [HIVEMALL-219] Fixed LDA bug for singl...

2018-09-18 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/166

[HIVEMALL-219] Fixed LDA bug for single update and added unit tests

## What changes were proposed in this pull request?

Fixed LDA bug for single update and added unit tests

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-219

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [x] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-219-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #166


commit 202eddd71c00e3889c0a126fe1038df35c1513d9
Author: Makoto Yui 
Date:   2018-09-18T10:36:02Z

Fixed LDA bug for single update and added unit tests




---


[GitHub] incubator-hivemall pull request #165: [HIVEMALL-219][BUGFIX] Fixed NPE in fi...

2018-09-18 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/165

[HIVEMALL-219][BUGFIX] Fixed NPE in finalizeTraining()

## What changes were proposed in this pull request?

Fixed NPE in finalizeTraining() where there are no training example 

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-219

## How was this patch tested?

to appear 

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-219

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #165


commit bc0e14d1d29ba13b173165bca9d9511b19abbc6e
Author: Makoto Yui 
Date:   2018-09-18T09:42:06Z

Fixed NPE in finalizeTraining()




---


[GitHub] incubator-hivemall pull request #164: [HIVEMALL-218] Fixed train_lda NPE whe...

2018-09-07 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/164

[HIVEMALL-218] Fixed train_lda NPE where input row is null

## What changes were proposed in this pull request?

Fixed NegativeArraySizeException where input is NULL of `train_lda`

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-218

## How was this patch tested?

manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [x] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-218

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #164


commit 67f6f68acad09c7a0e70f9fbdb183116eeec6a1d
Author: Makoto Yui 
Date:   2018-09-07T08:56:43Z

Fixed NegativeArraySizeException where input is NULL

commit d367de34e34d42514c0bb6141fbf31f295e33e50
Author: Makoto Yui 
Date:   2018-09-07T09:15:05Z

Fixed NPE in forward()




---


[GitHub] incubator-hivemall issue #163: [HIVEMALL-196][WIP] Support BM25 scoring

2018-09-06 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/163
  
Please add a unit test and evaluate this function on Hive environment.


---


[GitHub] incubator-hivemall pull request #163: [HIVEMALL-196][WIP] Support BM25 scori...

2018-09-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/163#discussion_r215564184
  
--- Diff: core/src/main/java/hivemall/ftvec/text/OkapiBM25UDF.java ---
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.ftvec.text;
+
+import hivemall.UDFWithOptions;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.Hive;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import hivemall.utils.hadoop.HiveUtils;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.DoubleWritable;
+
+import javax.annotation.Nonnull;
+import java.util.Arrays;
+
+@Description(name = "okapi_bm25",
+value = "_FUNC_(double tf_word, int dl, double avgdl, int N, int n 
[, const string options]) - Return an Okapi BM25 score in float")
+//TODO: What does stateful mean?
--- End diff --


https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/ql/udf/UDFType.html#stateful()

So, it's okey `stateful = false`. Please remove this comment.


---


[GitHub] incubator-hivemall pull request #162: [HIVEMALL-217] Resolve missing links f...

2018-09-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/162#discussion_r215535078
  
--- Diff: docs/gitbook/tips/emr.md ---
@@ -21,15 +21,15 @@
 
 ## Prerequisite
 Learn how to use Hive with Elastic MapReduce (EMR).  

-http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive.html
+https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html
 
 Before launching an EMR job, 
 * create ${s3bucket}/emr/outputs for outputs
 * optionally, create ${s3bucket}/emr/logs for logging
-* put 
[emr_hivemall_bootstrap.sh](https://raw.github.com/myui/hivemall/master/scripts/misc/emr_hivemall_bootstrap.sh)
 on ${s3bucket}/emr/conf
+* put 
[emr_hivemall_bootstrap.sh](https://raw.githubusercontent.com/apache/incubator-hivemall/master/resources/misc/emr_hivemall_bootstrap.sh)
 on ${s3bucket}/emr/conf
 
 Then, lunch an EMR job with hive in an interactive mode.
-I'm usually lunching EMR instances with cheap Spot instances through [CLI 
client](http://aws.amazon.com/developertools/2264) as follows:
+I'm usually lunching EMR instances with cheap Spot instances through [CLI 
client](https://aws.amazon.com/jp/tools/) as follows:
--- End diff --

should be `https://aws.amazon.com/tools/`


---


[GitHub] incubator-hivemall pull request #162: [HIVEMALL-217] Resolve missing links f...

2018-09-04 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/162#discussion_r214870585
  
--- Diff: docs/gitbook/tips/hadoop_tuning.md ---
@@ -75,13 +75,13 @@ feature_dimensions (2^24 by the default) * 4 bytes 
(float) * 2 (iff covariance i
 ```
 > 2^24 * 4 bytes * 2 * 1.2 ≈ 161MB
 
-When 
[SpaceEfficientDenseModel](https://github.com/apache/incubator-hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java)
 is used, the formula changes as follows:
+When 
[SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java)
 is used, the formula changes as follows:
--- End diff --

`github.com/myui` is deprecated. 

Use 
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/model/SpaceEfficientDenseModel.java
 instead

other appearance of `github.com/myui` as well.


---


[GitHub] incubator-hivemall pull request #160: [HIVEMALL-163] Add IS_INFINITE, IS_FIN...

2018-09-03 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/160#discussion_r214800712
  
--- Diff: core/src/main/java/hivemall/tools/math/IsInfiniteUDF.java ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.math;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+
+@Description(name = "is_infinite", value = "_FUNC_(x) - Determine if x is 
infinite.")
+public final class IsInfiniteUDF extends UDF {
+public Boolean evaluate(Double num) {
+if (num == null) {
+return null;
+} else {
+return !num.isNaN() && num.isInfinite();
--- End diff --

Is `!num.isNaN() &&` required? 


---


[GitHub] incubator-hivemall pull request #161: [HIVEMALL-216] Fix Docker image based ...

2018-09-03 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/161#discussion_r214793366
  
--- Diff: docs/gitbook/docker/getting_started.md ---
@@ -17,29 +17,31 @@
   under the License.
 -->
 
+# Getting started with Hivemall on Docker
+
 This page introduces how to run Hivemall on Docker.
 
 
 
 >  Caution
 > This docker image contains a single-node Hadoop enviroment for 
evaluating Hivemall. Not suited for production uses.
 
-# Requirements
+## Requirements
 
  * Docker Engine 1.6+
  * Docker Compose 1.10+
 
-# 1. Build image
+## 1. Build image
--- End diff --

Could you remove `1.` and `2.`?

See what's happing in

http://hivemall.incubator.apache.org/userguide/docker/getting_started.html#1-build-image


---


[GitHub] incubator-hivemall pull request #159: [HIVEMALL-214][DOC] Update userguide f...

2018-08-31 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/159

[HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor 
example

## What changes were proposed in this pull request?

Refine user guide for generic classifier/regressor and so on.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-214

## How to use this feature?

See user guide.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-214

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/159.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #159


commit 6f40c466e21c78238a74f9c2f227df8ae156b3e2
Author: Makoto Yui 
Date:   2018-08-31T07:38:17Z

Added general classifier example using a9a dataset

commit 4963b63ab685aa539c6c0f5f3cd3230215ba4df7
Author: Makoto Yui 
Date:   2018-08-31T07:46:31Z

Added assertions for deprecated contents

commit 472821279d70e4171b7cf391a09bac10c95e28cb
Author: Makoto Yui 
Date:   2018-08-31T08:02:13Z

Capitalized topics and fixed a typo

commit 649e77840ff154bd75cd7c1bfdfc245516b68b0d
Author: Makoto Yui 
Date:   2018-08-31T11:18:50Z

Refined user guide




---


[GitHub] incubator-hivemall issue #158: [HIVEMALL-215] Add step-by-step tutorial on S...

2018-08-30 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/158
  
@chezou Merged. Thank you for your first contribution!


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214236762
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,457 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
--- End diff --

Remove obvious `with Apache Hivemall`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214222772
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
--- End diff --

General introduction to Apache Hive and HiveQL is not required for 
Hivemall's document. The base document is for introducing Hivemall to TD's 
customers who might not aware differences of Hive and Presto.

You can start with `Apache Hivemall is a ... lines of query as follows:`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214226384
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html)).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223029
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
--- End diff --

Insert here something like..

You can create this table as follows:

```sql
create table if not exists purchase_history as ..
```


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223937
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
--- End diff --

Better to insert the following sentence after the example.

Feature index and feature value are separated by comma. When comma is 
omitted, the value is considered to be `1.0`. So, a categorical feature 
`gender#male` a [one-hot 
representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science)
 of `index := gender#male` and `value := 1.0`. Note that `#` is not a special 
charactor.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890176
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890053
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
--- End diff --

remove `TD`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890384
  
--- Diff: docs/gitbook/SUMMARY.md ---
@@ -25,6 +25,7 @@
 * [Installation](getting_started/installation.md)
 * [Install as permanent 
functions](getting_started/permanent-functions.md)
 * [Input Format](getting_started/input-format.md)
+* [Step-by-Step Tutorial on Supervised 
Learning](getting_started/tutorial.md)
--- End diff --

Better moved to `Supervised Learning` or `Regression` section or  with 
renaming.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890012
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
--- End diff --

`TD console` should not appear here.


---


[GitHub] incubator-hivemall pull request #157: [HIVEMALL-212] Fix Classifier/Regresso...

2018-08-28 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/157

[HIVEMALL-212] Fix Classifier/Regressor not to forward zero weighted values

## What changes were proposed in this pull request?

Feature with weight = 0.0  need not to be saved in the prediction model. It 
is preferable to reduce the size of prediction model. So, this PR fixes 
Classifier/Regressor not to forward zero weighted values

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-212

## How was this patch tested?

unit tests and manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-212

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #157


commit 48aacae519837a7d69c5927cf0de470d29c6ee29
Author: Makoto Yui 
Date:   2018-08-27T09:54:16Z

Fixed not to hold zero weight features

commit 3954a2720502f027ff7f2b5b0cd08e1e77f66017
Author: Makoto Yui 
Date:   2018-08-27T09:54:43Z

Zero division handling

commit de16c54dcb7351ea901f81a3a4263eaef347bc60
Author: Makoto Yui 
Date:   2018-08-28T05:50:25Z

Fixed zero weighted feature handling

commit ddd88d42536dc2f59efdbcc9dfa86aeda3223a2f
Author: Makoto Yui 
Date:   2018-08-28T05:51:41Z

Added final




---


[GitHub] incubator-hivemall issue #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer for re...

2018-08-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/156
  
confirmed optimizer is working fine using a9a classification.
https://gist.github.com/myui/a33a06ff3cf7db0e63ba46ec29703e43


---


[GitHub] incubator-hivemall issue #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer for re...

2018-08-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/156
  
@takuti fixed in 
https://github.com/apache/incubator-hivemall/pull/156/commits/84d1aeb9ca06fd5e6d83686b183543a1d57b06c8
 FYI


---


[GitHub] incubator-hivemall pull request #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer...

2018-08-23 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/156

[HIVEMALL-211][BUGFIX] Fixed Optimizer for regularization updates

## What changes were proposed in this pull request?

This PR fixes a bug of regularization scheme of Optimizer.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-211

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-211

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #156


commit 84d1aeb9ca06fd5e6d83686b183543a1d57b06c8
Author: Makoto Yui 
Date:   2018-08-24T05:54:23Z

Fixed regularization scheme and updated Adagrad rule




---


[GitHub] incubator-hivemall issue #155: [HIVEMALL-201-2] Evaluate, fix and document F...

2018-08-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/155
  
@takuti will merge after EMR tests. FYI


---


[GitHub] incubator-hivemall pull request #155: [HIVEMALL-201-2] Evaluate, fix and doc...

2018-08-22 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/155

[HIVEMALL-201-2] Evaluate, fix and document FFM

## What changes were proposed in this pull request?

Applied some refactoring to #149 
This PR closes #149 

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-201

## How was this patch tested?

unit tests, manual tests

## How to use this feature?

Will be published at: 
http://hivemall.incubator.apache.org/userguide/binaryclass/criteo_ffm.html

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [x] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-201-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #155


commit c4d6855d6286249e150e4c8dcd5413bcde339990
Author: Takuya Kitazawa 
Date:   2018-05-16T08:39:32Z

Use pre-defined constants in option description

commit f7e7e1d49e5fa2e4f4f50d55f85c5cdee3bb69b1
Author: Takuya Kitazawa 
Date:   2018-05-16T08:40:48Z

Fix mismatch between opts.addOption and cl.getOptionValue

commit 929781a982f86851e38d558bb79a239d90c90e76
Author: Takuya Kitazawa 
Date:   2018-05-16T08:41:34Z

Support FFM feature format in `l1_normalize` and `l2_normalize`

commit a1751361f8ae2204cdc6507514945ebaa1ddf179
Author: Takuya Kitazawa 
Date:   2018-05-21T06:02:14Z

Increase `alphaFTRL` in `testSampleEnableNorm` for convergence

commit ff049d776133d1bc0cf7e62d9740f22a3943f593
Author: Takuya Kitazawa 
Date:   2018-05-22T02:16:51Z

Fix typo

commit 35a02451fc4e8a55bbb49b7fede3c545145b7d6e
Author: Takuya Kitazawa 
Date:   2018-05-22T05:22:35Z

Fix bug in forward model

Due to typo, linear weights in model are not correctly forwarded.

commit 9782136e3059df1d334c814c9eb9455e1ec9b573
Author: Takuya Kitazawa 
Date:   2018-05-22T06:39:22Z

Fix order of computing AdaGrad learning rate

* Gradient includes regularization term
* Get sum of squared gradient after adding the latest gradient

See:

https://github.com/guestwalk/libffm/blob/7db5b4f1ad3af7eb5bd0c224b2fa5305e1a715d2/ffm.cpp#L219-L226

commit 2366d910581248249a4e69e1110675469a17ea99
Author: Takuya Kitazawa 
Date:   2018-05-22T06:47:03Z

Enable to specify initial learn rate for AdaGrad

commit f1fd20cd508a8473bd0fef037cd708d5c3379c5f
Author: Takuya Kitazawa 
Date:   2018-05-22T08:35:36Z

Make `-max_init_value` more meaningful

In fact, the code sampled random value from [0, max_init_value / k], but
users expect that each element in V is exactly initialized random values
in [0, max_init_value].

commit 478f26dab385b3835cdfbe19d40beef47336d92d
Author: Takuya Kitazawa 
Date:   2018-05-23T05:19:17Z

Add `-l2norm` option to FeaturePairsUDTF

Users can configure if feature vector is L2 normalized in a similar way
to `train_ffm`.

commit 3627ca84e857210aa921fd607fed19759d26fba0
Author: Takuya Kitazawa 
Date:   2018-05-23T06:27:02Z

Switch `-disable_wi` option to `-enable_wi`

commit e2c378f5134c67d25047169324c6aa9df62e8b8f
Author: Takuya Kitazawa 
Date:   2018-05-23T07:01:09Z

Fix test broken by change of default learn rate for FFM+AdaGrad

commit 056dfde30437c9bbcfcaf292698ba97dfa67
Author: Takuya Kitazawa 
Date:   2018-05-23T07:27:34Z

FFM applies instance-wise L2 normalization by default

commit 91aed6ecdc5401d972eac534e54246c59fd15ebb
Author: Takuya Kitazawa 
Date:   2018-05-24T00:48:37Z

Increase default number of iterations to rely more on cv_test

commit dca7e5762d664039354d00da8c3ca9adccd5d7c2
Author: Takuya Kitazawa 
Date:   2018-05-24T04:23:24Z

Make default L2 regularization parameter smaller

New default value 0.0001 is same as FTRL and general
regressor/classifier.

0.01 was large on small data; a model cannot be successfully learnt in
some cases. By contrast, LIBFFM uses very small value 0.2 by
default.  This commit sets 0.0001, a middle of these values, as a
compromise.

commit f84c960285f04ada21fb346e94ed0b5683d31289
Author: Takuya Kitazawa 
Date:   2018-05-24T04:49:27Z

Increase default learn rate from 0.05 to 0.1

Referred the following implementations.

LIBFFM: 0.2 (with AdaGrad)

https://github.com/guestwalk/libffm/blob/740103e5eb920a4061dd8e977a2ede6d23c6910a/ffm.h#L31

libFM: 0.1

https://github.com/srendle/libfm

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-08-20 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r211470802
  
--- Diff: 
core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -123,17 +117,18 @@ void updateWi(final double dloss, @Nonnull final 
Feature x, final long t) {
 }
 
 final double Xi = x.getValue();
-float gradWi = (float) (dloss * Xi);
 
 final Entry theta = getEntryW(x);
 float wi = theta.getW();
 
-final float eta = eta(theta, t, gradWi);
-float nextWi = wi - eta * (gradWi + 2.f * _lambdaW * wi);
+float grad = (float) (dloss * Xi + 2.f * _lambdaW * wi);
--- End diff --

regularization should not be performed here (?)


---


[GitHub] incubator-hivemall issue #139: [HIVEMALL-182][SPARK][WIP] Add an optimizer r...

2018-08-13 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/139
  
@maropu is this PR still WIP?


---


[GitHub] incubator-hivemall issue #154: [HIVEMALL-210][BUGFIX] Fix a bug in lda_predi...

2018-08-05 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/154
  
@takuti thank you for the comments. Reflected your reviews.


---


[GitHub] incubator-hivemall issue #154: [HIVEMALL-210][BUGFIX] Fix a bug in lda_predi...

2018-08-04 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/154
  
@takuti could you review this PR?


---


[GitHub] incubator-hivemall pull request #154: [HIVEMALL-210][BUGFIX] Fix a bug in ld...

2018-08-04 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/154

[HIVEMALL-210][BUGFIX] Fix a bug in lda_predict/plsa_predict

## What changes were proposed in this pull request?

Fixed a bug in lda_predict/plsa_predict that duplicated term probability is 
[unexpectedly 
replaced](https://github.com/apache/incubator-hivemall/blame/a8a97d6e873d5a8a30b06f92ddc14d1ec95c2738/core/src/main/java/hivemall/topicmodel/LDAPredictUDAF.java#L396)

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-210

## How was this patch tested?

unit tests and manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall HIVEMALL-210

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/154.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #154


commit 4e38897afc7af92e82635198359103d79b25dc82
Author: Makoto Yui 
Date:   2018-08-04T17:55:59Z

Added sortable KeyValue structs

commit 2ab5bf5cf3862f20e7c5aa096cf8d7c65cde9b50
Author: Makoto Yui 
Date:   2018-08-04T17:56:37Z

Fixed a bug in lda_predict and plsa_predict




---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-26 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r205390805
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -399,9 +399,8 @@ public void initRandom(int factor, long seed) {
 protected static final void uniformFill(final float[] a, final Random 
rand,
 final float maxInitValue) {
 final int len = a.length;
-final float basev = maxInitValue / len;
 for (int i = 0; i < len; i++) {
-float v = rand.nextFloat() * basev;
+float v = rand.nextFloat() * maxInitValue;
--- End diff --

While this modified `random` initialization is not used for classification 
(and only for regression), your evaluation is only for classification. 

This, it's doubtful that this change contributed for improving accuracy.


---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r200611442
  
--- Diff: 
core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -51,11 +50,6 @@
 public FieldAwareFactorizationMachineModel(@Nonnull FFMHyperParameters 
params) {
 super(params);
 this._params = params;
-if (params.useAdaGrad) {
-this._eta0 = 1.0f;
--- End diff --

better to use large default eta0 for adagrad.


---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r200605210
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer 
srcBuf, @Nonnull NioStateful
 srcBuf.clear();
 }
 
-public void train(@Nonnull final Feature[] x, final double y,
-final boolean adaptiveRegularization) throws HiveException {
+protected void checkInputVector(@Nonnull final Feature[] x) throws 
HiveException {
 _model.check(x);
+}
+
+protected void processValidationSample(@Nonnull final Feature[] x, 
final double y)
+throws HiveException {
+if (_adaptiveRegularization) {
+trainLambda(x, y); // adaptive regularization
+}
+if (_earlyStopping) {
+double p = _model.predict(x);
+double loss = _lossFunction.loss(p, y);
+_validationState.incrLoss(loss);
+}
+}
+
+public void train(@Nonnull final Feature[] x, final double y, final 
boolean validation)
+throws HiveException {
+checkInputVector(x);
--- End diff --

avoid too many virtual method call. 

`_model.check(x);` is enough both for FM and FFM.


---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r200604967
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer 
srcBuf, @Nonnull NioStateful
 srcBuf.clear();
 }
 
-public void train(@Nonnull final Feature[] x, final double y,
-final boolean adaptiveRegularization) throws HiveException {
+protected void checkInputVector(@Nonnull final Feature[] x) throws 
HiveException {
 _model.check(x);
+}
+
+protected void processValidationSample(@Nonnull final Feature[] x, 
final double y)
+throws HiveException {
+if (_adaptiveRegularization) {
+trainLambda(x, y); // adaptive regularization
+}
+if (_earlyStopping) {
--- End diff --

earlyStopping is better to be performed before adaptiveRegularization.


---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-06 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r200590772
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -283,9 +293,16 @@ public void process(Object[] args) throws 
HiveException {
 }
 
 ++_t;
-recordTrain(x, y);
-boolean adaptiveRegularization = (_va_rand != null) && _t >= 
_validationThreshold;
-train(x, y, adaptiveRegularization);
+
+boolean validation = false;
+if ((_va_rand != null) && _t >= _validationThreshold) {
+final float rnd = _va_rand.nextFloat();
+validation = rnd < _validationRatio;
+}
+
+recordTrain(x, y, validation);
+
+train(x, y, validation);
--- End diff --

Validation examples are fixed in this implementation. Also, not using 
non-validation examples for regularization is a bad strategy.


---


[GitHub] incubator-hivemall issue #153: [HIVEMALL-208] Upgrade to Lucene 5.5.5

2018-07-05 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/153
  
@iijima-satoshi LGTM. Merged. Thank you for your contribution!


---


[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

2018-07-04 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r200093417
  
--- Diff: 
core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -259,9 +255,9 @@ protected final float eta(@Nonnull final Entry theta, 
final long t, final float
 protected final float eta(@Nonnull final Entry theta, @Nonnegative 
final int f, final long t,
 final float grad) {
 if (_useAdaGrad) {
-double gg = theta.getSumOfSquaredGradients(f);
--- End diff --

@takuti This behavior (that used in libffm) is wrong in strict sense and 
previous code is much better because initial eta should equals to `eta0` but 
this implementation depends on the initial gradient. 


---


[GitHub] incubator-hivemall issue #153: [HIVEMALL-208] Upgrade to Lucene 5.5.5

2018-06-27 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/153
  
@iijima-satoshi 

Thank you for the contribution. Will merge testing.

@takuti 

You need to update Lucene version to `5.5.5` in `tokenize_ja_kuromoji`.
https://github.com/takuti/hive-udf-neologd/blob/master/pom.xml#L16


---


[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

2018-06-21 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
> -lambda 0.0001 (default), -init_v adjusted_random
0.6756640217829124  0.8644404496920104

> -lambda 0.001, -init_v adjusted_random
0.6749224090640931  0.8642914100412997

> -lambda 0.002, -init_v adjusted_random
0.6729486759257253  0.862249033512779

> -lambda 0.01, -init_v adjusted_random
0.6728088660666263  0.8568219312625348

• libfm
```
eta=0.1
init_stdev=0.1
reg0 = 0.0;
regw = 0.0;
regv = 0.0;
```


https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L87

https://github.com/srendle/libfm/blob/30b9c799c41d043f31565cbf827bf41d0dc3e2ab/src/fm_core/fm_model.h#L73

• libffm
```
eta = 0.1; // learning rate
lambda = 0.2; // regularization parameter
nr_iters = 15;
k = 4; // number of latent factors
```


https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L84


---


[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

2018-06-19 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
revising in 
https://github.com/myui/incubator-hivemall/commits/HIVEMALL-201-2


---


[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

2018-06-19 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti with the modified default hyperparameter of FM, the performance of 
FM is getting worse.

Before 
> 0.6736798239047873 (mae) 0.858938110314545 (rmse)

After
> 0.6837803085633278 (mae) 0.876690912076831 (rmse)

http://hivemall.incubator.apache.org/userguide/recommend/movielens_fm.html


---


[GitHub] incubator-hivemall pull request #151: Relocated org.codehaus.jackson to hive...

2018-06-10 Thread myui
GitHub user myui opened a pull request:

https://github.com/apache/incubator-hivemall/pull/151

Relocated org.codehaus.jackson to hivemall.codehause.jackson in 
hivemall-all.jar

## What changes were proposed in this pull request?

Relocated `org.codehaus.jackson` to `hivemall.codehause.jackson` in 
hivemall-all.jar because Jackson can be missing in some Hadoop/Hive enviroment

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-203

## How was this patch tested?

manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, 
for your commit?
- [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/myui/incubator-hivemall relocate_jackson

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #151


commit a07350f2e8e69a6fcd494df714f1108476b97bc8
Author: Makoto Yui 
Date:   2018-06-10T10:00:30Z

Relocated org.codehaus.jackson to hivemall.codehause.jackson in 
hivemall-all.jar




---


[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...

2018-06-06 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/135
  
I'm going to merge this PR to master. If you find any problem, please 
comment here.


---


[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...

2018-06-05 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/135
  
For K-minimum Values (KMV) and Sketch related codes, I'll create an another 
JIRA ticket.

For other UDFs, we accept incoming PRs.

https://docs.google.com/spreadsheets/d/1gtFNcTvPR9OZAsbobj2D9d37tOx4nAoSlib9CLdEDQg/edit#gid=0


---


[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...

2018-06-05 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/135
  
@jeromebanks I'm considering to merge this PR. Could you review if possible?


---


[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

2018-06-04 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
We need to remain the default hyperparameter of FM as it is for backward 
compatibility. I'll take care of it on merging.


---


[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

2018-05-30 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti Sure.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r191645232
  
--- Diff: 
core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java ---
@@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, 
IOException {
 cumulativeLoss > udtf._validationState.getCumulativeLoss());
 }
 
+@Test(expected = IllegalArgumentException.class)
+public void testUnsupportedAdaptiveRegularizationOption() throws 
Exception {
+
TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class,
+new ObjectInspector[] {
+ObjectInspectorFactory.getStandardListObjectInspector(
+
PrimitiveObjectInspectorFactory.javaStringObjectInspector),
+
PrimitiveObjectInspectorFactory.javaDoubleObjectInspector,
+ObjectInspectorUtils.getConstantObjectInspector(
+
PrimitiveObjectInspectorFactory.javaStringObjectInspector,
+"-seed 43 -adaptive_regularization")},
+new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}});
--- End diff --

Better to compare accuracy against the default regularization. In general, 
it should be better than the default one.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309443
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -92,6 +92,14 @@ protected float getW(int i) {
 
 protected abstract void setW(@Nonnull Feature x, float nextWi);
 
+protected void setW(int i, float nextWi) {
--- End diff --

No need to have `protected void setW(int i, float nextWi)` and `protected 
void setW(@Nonnull String j, float nextWi)` in FactorizationMachineModel.




---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r191308798
  
--- Diff: core/src/main/java/hivemall/fm/FMArrayModel.java ---
@@ -80,6 +80,11 @@ public float getW(@Nonnull final Feature x) {
 @Override
 protected void setW(@Nonnull Feature x, float nextWi) {
 int i = x.getFeatureIndex();
+setW(i, nextWi);
--- End diff --

better to avoid method call.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309836
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -92,6 +92,14 @@ protected float getW(int i) {
 
 protected abstract void setW(@Nonnull Feature x, float nextWi);
 
+protected void setW(int i, float nextWi) {
--- End diff --

`setW(int i, float nextWi)` is no more used when avoid caching in early 
stopping.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r191298514
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final 
Feature[] x) throws HiveException
 _model.check(x);
 }
 
+protected void processValidationSample(@Nonnull final Feature[] x, 
final double y)
+throws HiveException {
+if (_adaptiveRegularization) {
+trainLambda(x, y); // adaptive regularization
--- End diff --

`FFM fully ignores adaptive regularization option` is expected behavior.
Not tested AdaptiveRegularization with FFM and/or FTRL.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
This kind of behavior could often be happen and Libffm's early stopping 
strategy is too aggressive.

```
   7  0.43239  0.46952
   8  0.42362  0.46999
   9  0.41394  0.45088 
```


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
```
iter   tr_logloss   va_logloss
   1  0.49738  0.48776
   2  0.47383  0.47995
   3  0.46366  0.47480
   4  0.45561  0.47231
   5  0.44810  0.47034
   6  0.44037  0.47003
   7  0.43239  0.46952
   8  0.42362  0.46999 <- ffm stops one va_logloss is increased 
but va_logloss might decrease in the next iteration
   9  0.41394  0.47088 <- once 
```

In 8-th iteration, `ready to stop once va_logloss increase`. 
If va_logloss descreases in the 9th iteration, then continue iteration (set 
not ready to finish).
If va_logloss increases in the 9th iteration, then emit the current model  
in the 9th iteration.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-25 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r190843344
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer 
srcBuf, @Nonnull NioStateful
 srcBuf.clear();
 }
 
+protected void checkInputVector(@Nonnull final Feature[] x) throws 
HiveException {
+_model.check(x);
+}
+
 public void train(@Nonnull final Feature[] x, final double y,
 final boolean adaptiveRegularization) throws HiveException {
-_model.check(x);
+checkInputVector(x);
 
 try {
 if (adaptiveRegularization) {
--- End diff --

I think there are no need to share `train` if `adaptiveRegularization` is 
always be off for FFM and `early_stopping` is always off for FM. The logic in 
train becomes complex.


---


[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

2018-05-25 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/149#discussion_r190842171
  
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) 
throws HiveException {
 inputBuf.flip();
 
 for (int iter = 2; iter <= iterations; iter++) {
+if (earlyStopValidation) {
--- End diff --

better to avoid many `if (earlyStopValidation) {`.

`_validateState` can always be non-null when `if(earlyStopValidation && 
_validateState.isLossIncreased()` never be true.


---


[GitHub] incubator-hivemall issue #150: update conv.awk location

2018-05-24 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/150
  
Merged Thanks.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-23 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
It might be better to reconsider `eta0` when enabling `l2norm` by the 
default and by enlarging`max_init_size`. In my experience for FM, init random 
size should be small when the avg feature dimension is large (gradients will be 
large).

I think `1.0` is too aggressive for the default though. `0.2` or `0.5`? 
Better to research other implementations.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti so then, better to enable l2_norm by the default and 
`-disable_l2norm` to disable l2 normalization. My concern is that L2 
normalization performed worse for small datasets with adequate learning rate 
`[0.1,1.0]`. 

FieldAwareFactorizationMachineUDTFTest contains several tests. It's better 
to find that accuracy will not be bad with new default options, enabling L2 
normalization.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
Also, it's better to revise default `-iters` from 1 to 10 (at least 10 
iterations with early stopping).


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
BTW, it might be better to implement `early stopping` using validation data.
https://github.com/guestwalk/libffm

We can use a similar approaches to `_validationRatio` used in 
`FactorizationMachineUDTF` instead of preparing validation dataset.


---


  1   2   3   4   5   6   7   8   >