[GitHub] incubator-hivemall pull request #175: [WIP][HIVEMALL-230] Revise Optimizer I...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/175 [WIP][HIVEMALL-230] Revise Optimizer Implementation ## What changes were proposed in this pull request? Revise Optimizer implementation. 1. Revise default hyperparameters of AdaDelta and Adam. 2. Support AdamW, AdamHD, Eve, and YellowFin optimizer. * Fixing Weight Decay Regularization in Adam https://openreview.net/forum?id=rk6qdGgCZ * On the Convergence of Adam and Beyond https://openreview.net/forum?id=ryQu7f-RZ * AdamHD (Adam with Hypergradient descent) https://arxiv.org/pdf/1703.04782.pdf ⢠Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates https://arxiv.org/abs/1611.01505 ⢠YellowFin and the Art of Momentum Tuning https://arxiv.org/abs/1706.03471 ## What type of PR is it? Improvement, Feature ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-230 ## How was this patch tested? unit tests, emr (to appear) ## How to use this feature? to appear ## Checklist - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall adam_test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #175 commit 5168cf06bf03c38f005d435a4415ce8cb8140891 Author: Makoto Yui Date: 2018-12-03T07:04:29Z Added ongoing unit test files commit ed1b6302183a687a3584fe62ce5fa92b26c828ad Author: Makoto Yui Date: 2018-12-04T09:41:42Z Fixed to show ETA in debug log commit 5c9d63f9fc184f05eed28f03986c6269c4ea6e93 Author: Makoto Yui Date: 2018-12-04T09:42:02Z Added unit tests commit 243f4b40899b960f4942c75f89c0c4c94974b03b Author: Makoto Yui Date: 2018-12-05T09:48:17Z Added comments commit ae29e9a669dcd311b154615e19900ec4b01fd4d8 Author: Makoto Yui Date: 2018-12-06T07:08:48Z Refactored commit c25ce02db537570c6ed75db74d9a3783b316c694 Author: Makoto Yui Date: 2018-12-06T07:10:05Z Added square() method commit 71671d10138aa54c0485809b6126753a54dbe3e8 Author: Makoto Yui Date: 2018-12-06T07:10:42Z Added helper methods commit 6f4edbbaaac37884533132dea00c81f36da45e50 Author: Makoto Yui Date: 2018-12-06T07:22:51Z Refactored ADAM implementation commit e61f22afaa46bdf705c2760cebaa601929a77608 Author: Makoto Yui Date: 2018-12-06T08:52:08Z Added logging message commit 22c3f7c132fc01528c93c6e15d40a2b70f1771c0 Author: Makoto Yui Date: 2018-12-06T08:53:01Z Improved -eta option to take eta0 for Fixed ETA estimator commit e9b9b1420c3b573b5cbe15e4340d862251fac81d Author: Makoto Yui Date: 2018-12-06T08:53:28Z Added unit test commit 7c6e4a1da5eaeb99c02a9a83f1519d5274131037 Author: Makoto Yui Date: 2018-12-06T09:06:16Z Made eta default hyper-parameter flexible for each optimizer commit a92293906d43c25ce47032644774723a0cf713d9 Author: Makoto Yui Date: 2018-12-06T09:36:26Z Changed the default hyperparameter of AdaDelta commit 1494ea298497a846650b2d9f6799add77105ae77 Author: Makoto Yui Date: 2018-12-07T05:03:21Z Reduced the size of test data commit 79197a84ca4d840ab3150730d5e6d4a5ad96e719 Author: Makoto Yui Date: 2018-12-07T05:39:13Z Improved -help option handling commit 4fdcf6c84ec81c174f5e107038660b1200b1a9a5 Author: Makoto Yui Date: 2018-12-07T05:48:07Z Added assertions commit e1c7a68df679a65f496268bd4acc286b19d0a964 Author: Makoto Yui Date: 2018-12-07T07:39:58Z Fixed AdaDelta eta to 1.0 commit b8e5698ecd7e7d2758ef85a338c053f5bbcc663d Author: Makoto Yui Date: 2018-12-07T09:13:48Z Supported -amsgrad in Adam commit aa512c3b71039f97c2ac08b598fcb11f1cfc4d80 Author: Makoto Yui Date: 2018-12-07T09:59:59Z Supported -decay option in ADAM optimizer commit 19bd276ff9867ba93f42c241feb9aa5aafd0836c Author: Makoto Yui Date: 2018-12-07T10:15:24Z Revise the default eta0/alpha value commit 19fa61145e8be18c3f86988905b35f171e1ee50e Author: Makoto Yui Date: 2018-12-10T08:37:05Z Revised ADAM hyperparameter treatment ---
[GitHub] incubator-hivemall pull request #173: [HIVEMALL-227][DOC] Removed md5 and re...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/173 [HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new ASF policy ## What changes were proposed in this pull request? Removed md5 and replace sha1 with sha512 following new ASF policy ## What type of PR is it? Documentation ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-227 You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-227 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #173 commit 583eb9991cf8db730d46b431b1cb80ebaeb293a8 Author: Makoto Yui Date: 2018-11-15T09:18:39Z Removed md5 and replace sha1 with sha512 following new ASF policy ---
[GitHub] incubator-hivemall issue #171: [SPARK][HOTFIX] Fix the existing test failure...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/171 Merged. Thanks! ---
[GitHub] incubator-hivemall issue #172: Fix typo
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/172 Merged, thanks! ---
[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/171#discussion_r233324312 --- Diff: spark/spark-2.3/src/test/scala/org/apache/spark/sql/hive/XGBoostSuite.scala --- @@ -77,6 +77,7 @@ final class XGBoostSuite extends VectorQueryTest { val model = hiveContext.sparkSession.read.format("libxgboost").load(tempDir) val predict = model.join(mllibTestDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") --- End diff -- BTW, could you paste Stacktrace of the exception? ---
[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/171#discussion_r233288186 --- Diff: spark/pom.xml --- @@ -52,6 +52,12 @@ hivemall-core ${project.version} compile + + + io.netty + netty-all + --- End diff -- ah... I see. ---
[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/171#discussion_r233287740 --- Diff: spark/spark-2.3/src/main/scala/org/apache/spark/sql/hive/HivemallOps.scala --- @@ -1935,18 +1935,6 @@ object HivemallOps { ) } - /** - * @see [[hivemall.tools.array.SubarrayUDF]] - * @group tools.array - */ - def subarray(original: Column, fromIndex: Column, toIndex: Column): Column = withExpr { -planHiveUDF( - "hivemall.tools.array.SubarrayUDF", - "subarray", - original :: fromIndex :: toIndex :: Nil -) - } --- End diff -- Replacing SubarrayUDF with ArraySliceUDF is not easy? ``` def subarray(original: Column, fromIndex: Column, length: Column): Column = withExpr { planHiveUDF( "hivemall.tools.array.ArraySliceUDF", ``` ---
[GitHub] incubator-hivemall pull request #171: [SPARK][HOTFIX][WIP] Fix existing test...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/171#discussion_r233287092 --- Diff: spark/spark-2.3/src/test/scala/org/apache/spark/sql/hive/XGBoostSuite.scala --- @@ -77,6 +77,7 @@ final class XGBoostSuite extends VectorQueryTest { val model = hiveContext.sparkSession.read.format("libxgboost").load(tempDir) val predict = model.join(mllibTestDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") --- End diff -- Let's disable xgboost for spark-2.3. ---
[GitHub] incubator-hivemall pull request #170: [WIP][HIVEMALL-223] Add -kv_map and -v...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/170 [WIP][HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF ## What changes were proposed in this pull request? Add `-kv_map` and `-vk_map` option to `to_ordered_list` UDAF. ## What type of PR is it? Improvement ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-223 ## How was this patch tested? unit tests and manual tests on EMR ## How to use this feature? Will be described in http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array ## Checklist - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-223 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #170 commit 26f361ce7b355410772577f0754f4bb5537ababf Author: Makoto Yui Date: 2018-11-12T04:19:37Z Added -kv_map and -vk_map option commit 39ee911cb12e63f924229e962bbb00247297f75d Author: Makoto Yui Date: 2018-11-12T04:20:13Z Added WIP unit tests for -kv_map/vk_map option of to_ordered_list UDAF ---
[GitHub] incubator-hivemall issue #163: [HIVEMALL-196] Support BM25 scoring
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/163 @jaxony Merged with some modification. Thank you for your first contribution to Apache Hivemall! ---
[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/168 We might need to set asf mirror to avoid timeout by the default ASF repository. https://maven.apache.org/guides/mini/guide-mirror-settings.html https://code.i-harness.com/ja/q/c326f0 ---
[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/168 ``` [WARNING] Could not transfer metadata org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml from/to apache.snapshots (https://repository.apache.org/snapshots): Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: Connection timed out (Connection timed out) [WARNING] Failure to transfer org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml from https://repository.apache.org/snapshots/ was cached in the local repository, resolution will not be reattempted until the update interval of apache-snapshots has elapsed or updates are forced. Original error: Could not transfer metadata org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml from/to apache-snapshots (https://repository.apache.org/snapshots/): Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: Connection timed out (Connection timed out) [WARNING] Failure to transfer org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml from https://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced. Original error: Could not transfer metadata org.apache.hivemall:hivemall-spark2.1:0.5.1-incubating-SNAPSHOT/maven-metadata.xml from/to apache.snapshots (https://repository.apache.org/snapshots): Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] failed: Connection timed out (Connection timed out) [INFO] Downloading from apache-snapshots: https://repository.apache.org/snapshots/org/apache/hivemall/hivemall-spark2.1/0.5.1-incubating-SNAPSHOT/hivemall-spark2.1-0.5.1-incubating-SNAPSHOT-sources.jar [INFO] Downloading from apache.snapshots: https://repository.apache.org/snapshots/org/apache/hivemall/hivemall-spark2.1/0.5.1-incubating-SNAPSHOT/hivemall-spark2.1-0.5.1-incubating-SNAPSHOT-sources.jar ``` hmm could we provide mirror repository in travis-ci ? ---
[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/168 See what happens. ---
[GitHub] incubator-hivemall pull request #168: [HIVEMALL-221] Add cache to reduce Mav...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/168#discussion_r227797186 --- Diff: .travis.yml --- @@ -35,7 +40,7 @@ notifications: email: false script: - - ./bin/run_travis_tests.sh + - travis_wait 10 ./bin/run_travis_tests.sh --- End diff -- plz revert this change because this does not effect ---
[GitHub] incubator-hivemall pull request #168: [HIVEMALL-221] Add cache to reduce Mav...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/168#discussion_r227796760 --- Diff: .travis.yml --- @@ -1,5 +1,10 @@ sudo: false +cache: + timeout: 1500 + directories: + - $HOME/.m2 --- End diff -- Isn't `$HOME/.m2/repository` ? https://github.com/apache/kafka/blob/trunk/.travis.yml#L52 https://github.com/airlift/drift/blob/master/.travis.yml#L11 https://github.com/mesos/storm/blob/master/.travis.yml#L6 ---
[GitHub] incubator-hivemall issue #168: [HIVEMALL-221] Add cache to reduce Maven buil...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/168 @maropu Is this `clean` required? https://github.com/apache/incubator-hivemall/blob/master/bin/run_travis_tests.sh#L42 ---
[GitHub] incubator-hivemall pull request #169: [HIVEMALL-222] Introduce Gradient Clip...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/169 [HIVEMALL-222] Introduce Gradient Clipping to avoid exploding gradient to General Classifier/Regressor ## What changes were proposed in this pull request? Avoid [exploding gradients](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15%20Exploding%20and%20Vanishing%20Gradients.pdf) by gradient clipping (by value) ## What type of PR is it? Improvement ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-222 ## How was this patch tested? unit tests ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall clipping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/169.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #169 commit 0c10392d2a3c96b40df57e6b406333e0a239b9f9 Author: Makoto Yui Date: 2018-10-24T08:14:15Z Updated for debugging purpose commit e0dc4b954650c6751d6e37ee5ecf6c9656872b16 Author: Makoto Yui Date: 2018-10-24T08:15:03Z Introduced gradient clipping by value to avoid exploding gradients commit 7e932e99cfd990bb47ff7acfed44c19678fadc8f Author: Makoto Yui Date: 2018-10-24T08:15:52Z Added a unit test for gradient clipping ---
[GitHub] incubator-hivemall issue #168: Add cache to reduce Maven build time on Travi...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/168 Seems not working.. `timeout: 1000` helps (?) https://docs.travis-ci.com/user/caching/#setting-the-timeout Please add `[HIVEMALL-221]` to the PR title. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226845079 --- Diff: core/src/main/java/hivemall/fm/Feature.java --- @@ -383,4 +383,10 @@ public static void l2normalize(@Nonnull final Feature[] features) { } } +@Override --- End diff -- See https://medium.com/codelog/overriding-hashcode-method-effective-java-notes-723c1fedf51c Usually, overriding `equals` required `hashCode` because hashCode (and equals) is used for HashMap key search. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226579427 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + + +@Nonnegative +private float maxInitValue; +@Nonnegative +private double initStdDev; +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + + +} + +@Nonnegative +private final int factor; + +// rank matrix initialization +private final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private final Map theta; +private final Map beta; +private final Object2DoubleMap betaBias; +private final Map gamma; +private final Object2DoubleMap gammaBias; + +private final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +// solve +private final RealMatrix B; +private final RealVector A; + +// error message strings +private static final String ARRAY_NOT_SQUARE_ERR = "Array is not square"; +private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or array do not match in size"; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + float c0, float c1, float lambdaTheta, float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.betaBias.defaultReturnValue(0.d)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578817 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + + +@Nonnegative +private float maxInitValue; +@Nonnegative +private double initStdDev; +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + + +} + +@Nonnegative +private final int factor; + +// rank matrix initialization +private final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private final Map theta; +private final Map beta; +private final Object2DoubleMap betaBias; +private final Map gamma; +private final Object2DoubleMap gammaBias; + +private final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +// solve +private final RealMatrix B; +private final RealVector A; + +// error message strings +private static final String ARRAY_NOT_SQUARE_ERR = "Array is not square"; +private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or array do not match in size"; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + float c0, float c1, float lambdaTheta, float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.betaBias.defaultReturnValue(0.d)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578559 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + + +@Nonnegative +private float maxInitValue; +@Nonnegative +private double initStdDev; +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + + +} + +@Nonnegative +private final int factor; + +// rank matrix initialization +private final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private final Map theta; +private final Map beta; +private final Object2DoubleMap betaBias; +private final Map gamma; +private final Object2DoubleMap gammaBias; + +private final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +// solve +private final RealMatrix B; +private final RealVector A; + +// error message strings +private static final String ARRAY_NOT_SQUARE_ERR = "Array is not square"; +private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or array do not match in size"; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + float c0, float c1, float lambdaTheta, float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.betaBias.defaultReturnValue(0.d)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578495 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + --- End diff -- please remove unnessesary line breaks. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226579051 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + + +@Nonnegative +private float maxInitValue; +@Nonnegative +private double initStdDev; +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + + +} + +@Nonnegative +private final int factor; + +// rank matrix initialization +private final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private final Map theta; +private final Map beta; +private final Object2DoubleMap betaBias; +private final Map gamma; +private final Object2DoubleMap gammaBias; + +private final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +// solve +private final RealMatrix B; +private final RealVector A; + +// error message strings +private static final String ARRAY_NOT_SQUARE_ERR = "Array is not square"; +private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or array do not match in size"; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + float c0, float c1, float lambdaTheta, float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.betaBias.defaultReturnValue(0.d)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578854 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,715 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.annotations.VisibleForTesting; +import hivemall.fm.Feature; +import hivemall.utils.lang.Preconditions; +import hivemall.utils.math.MathUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; +import org.apache.hadoop.hive.ql.metadata.HiveException; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Random; + + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + + +@Nonnegative +private float maxInitValue; +@Nonnegative +private double initStdDev; +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + + +} + +@Nonnegative +private final int factor; + +// rank matrix initialization +private final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private final Map theta; +private final Map beta; +private final Object2DoubleMap betaBias; +private final Map gamma; +private final Object2DoubleMap gammaBias; + +private final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +// solve +private final RealMatrix B; +private final RealVector A; + +// error message strings +private static final String ARRAY_NOT_SQUARE_ERR = "Array is not square"; +private static final String DIFFERENT_DIMS_ERR = "Matrix, vector or array do not match in size"; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + float c0, float c1, float lambdaTheta, float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.betaBias.defaultReturnValue(0.d)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226578153 --- Diff: core/src/main/java/hivemall/fm/Feature.java --- @@ -383,4 +383,10 @@ public static void l2normalize(@Nonnull final Feature[] features) { } } +@Override --- End diff -- Why this `equals` method is required? Assume this is not used. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226525857 --- Diff: core/src/main/java/hivemall/mf/CofactorizationUDTF.java --- @@ -0,0 +1,574 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.UDTFWithOptions; +import hivemall.common.ConversionState; +import hivemall.fm.Feature; +import hivemall.fm.StringFeature; +import hivemall.utils.hadoop.HiveUtils; +import hivemall.utils.io.FileUtils; +import hivemall.utils.io.NioStatefulSegment; +import hivemall.utils.lang.NumberUtils; +import hivemall.utils.lang.Primitives; +import hivemall.utils.lang.SizeOf; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector; +import org.apache.hadoop.mapred.Counters; +import org.apache.hadoop.mapred.Reporter; + +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.io.File; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.ArrayList; +import java.util.List; + +import static hivemall.utils.lang.Primitives.FALSE_BYTE; +import static hivemall.utils.lang.Primitives.TRUE_BYTE; + +public class CofactorizationUDTF extends UDTFWithOptions { +private static final Log LOG = LogFactory.getLog(CofactorizationUDTF.class); + +// Option variables +// The number of latent factors +protected int factor; +// The scaling hyperparameter for zero entries in the rank matrix +protected float scale_zero; +// The scaling hyperparameter for non-zero entries in the rank matrix +protected float scale_nonzero; +// The preferred size of the miniBatch for training +protected int batchSize; +// The initial mean rating +protected float globalBias; +// Whether update (and return) the mean rating or not +protected boolean updateGlobalBias; +// The number of iterations +protected int maxIters; +// Whether to use bias clause +protected boolean useBiasClause; +// Whether to use normalization +protected boolean useL2Norm; +// regularization hyperparameters +protected float lambdaTheta; +protected float lambdaBeta; +protected float lambdaGamma; + +// Initialization strategy of rank matrix +protected CofactorModel.RankInitScheme rankInit; + +// Model itself +protected CofactorModel model; +protected int numItems; + +// Variable managing status of learning + +// The number of processed training examples +protected long count; + +protected ConversionState cvState; +private ConversionState validationState; + +// Input OIs and Context +protected StringObjectInspector contextOI; +protected ListObjectInspector featuresOI; +protected BooleanObjectInspector isItemOI; +protected ListObjectInspector sppmiOI; + +// Used for iterations +protected NioStatefulSegment fileIO; +protected ByteBuffer inputBuf; +private long lastWritePos; + +private Feature contextProbe; +private Feature[] featuresProbe; +private Feature[] sppmiProbe; +private boolean isItemProbe; +private long numValidations; +private long numTraining; +
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226247247 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,640 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import it.unimi.dsi.fastutil.objects.Object2DoubleArrayMap; +import it.unimi.dsi.fastutil.objects.Object2DoubleMap; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Object2DoubleMap betaBias; +private Map gamma; +private Object2DoubleMap gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new Object2DoubleArrayMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new Object2DoubleArrayMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights)
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226243032 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,638 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); +switch (initScheme) { +case random:
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226241124 --- Diff: core/src/main/java/hivemall/mf/CofactorizationUDTF.java --- @@ -0,0 +1,574 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.UDTFWithOptions; +import hivemall.common.ConversionState; +import hivemall.fm.Feature; +import hivemall.fm.StringFeature; +import hivemall.utils.hadoop.HiveUtils; +import hivemall.utils.io.FileUtils; +import hivemall.utils.io.NioStatefulSegment; +import hivemall.utils.lang.NumberUtils; +import hivemall.utils.lang.Primitives; +import hivemall.utils.lang.SizeOf; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector; +import org.apache.hadoop.mapred.Counters; +import org.apache.hadoop.mapred.Reporter; + +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.io.File; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.ArrayList; +import java.util.List; + +import static hivemall.utils.lang.Primitives.FALSE_BYTE; +import static hivemall.utils.lang.Primitives.TRUE_BYTE; + +public class CofactorizationUDTF extends UDTFWithOptions { +private static final Log LOG = LogFactory.getLog(CofactorizationUDTF.class); + +// Option variables +// The number of latent factors +protected int factor; +// The scaling hyperparameter for zero entries in the rank matrix +protected float scale_zero; +// The scaling hyperparameter for non-zero entries in the rank matrix +protected float scale_nonzero; +// The preferred size of the miniBatch for training +protected int batchSize; +// The initial mean rating +protected float globalBias; +// Whether update (and return) the mean rating or not +protected boolean updateGlobalBias; +// The number of iterations +protected int maxIters; +// Whether to use bias clause +protected boolean useBiasClause; +// Whether to use normalization +protected boolean useL2Norm; +// regularization hyperparameters +protected float lambdaTheta; +protected float lambdaBeta; +protected float lambdaGamma; + +// Initialization strategy of rank matrix +protected CofactorModel.RankInitScheme rankInit; + +// Model itself +protected CofactorModel model; +protected int numItems; + +// Variable managing status of learning + +// The number of processed training examples +protected long count; + +protected ConversionState cvState; +private ConversionState validationState; + +// Input OIs and Context +protected StringObjectInspector contextOI; +protected ListObjectInspector featuresOI; +protected BooleanObjectInspector isItemOI; +protected ListObjectInspector sppmiOI; + +// Used for iterations +protected NioStatefulSegment fileIO; +protected ByteBuffer inputBuf; +private long lastWritePos; + +private Feature contextProbe; +private Feature[] featuresProbe; +private Feature[] sppmiProbe; +private boolean isItemProbe; +private long numValidations; +private long numTraining; +
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226237654 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); +switch (initScheme) { +case random:
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226239653 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); +switch (initScheme) { +case random:
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226239017 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); +switch (initScheme) { +case random:
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226204201 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); --- End diff -- ``` final double[] v =
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226202891 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; + +// precomputed identity matrix +private RealMatrix identity; + +protected final Random[] randU, randI; + +// hyperparameters +private final float c0, c1; +private final float lambdaTheta, lambdaBeta, lambdaGamma; + +public CofactorModel(@Nonnegative int factor, @Nonnull RankInitScheme initScheme, + @Nonnull float c0, @Nonnull float c1, float lambdaTheta, + float lambdaBeta, float lambdaGamma) { + +// rank init scheme is gaussian +// https://github.com/dawenl/cofactor/blob/master/src/cofacto.py#L98 +this.factor = factor; +this.initScheme = initScheme; +this.globalBias = 0.d; +this.lambdaTheta = lambdaTheta; +this.lambdaBeta = lambdaBeta; +this.lambdaGamma = lambdaGamma; + +this.theta = new HashMap<>(); +this.beta = new HashMap<>(); +this.betaBias = new HashMap<>(); +this.gamma = new HashMap<>(); +this.gammaBias = new HashMap<>(); + +this.randU = newRandoms(factor, 31L); +this.randI = newRandoms(factor, 41L); + +checkHyperparameterC(c0); +checkHyperparameterC(c1); +this.c0 = c0; +this.c1 = c1; + +} + +private void initFactorVector(String key, Map weights) { +if (weights.containsKey(key)) { +return; +} +RealVector v = new ArrayRealVector(factor); +switch (initScheme) { +case random:
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226198983 --- Diff: core/src/main/java/hivemall/mf/FactorizedModel.java --- @@ -30,25 +30,25 @@ import javax.annotation.concurrent.NotThreadSafe; @NotThreadSafe -public final class FactorizedModel { +public class FactorizedModel { --- End diff -- It seems FactorizedModel is not used in Cofactor. Is this change required? Revert if not used. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226199747 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; +private Map gamma; +private Map gammaBias; --- End diff -- Please use `Object2DoubleMap gammaBias` instead to reduce memory consumption. ---
[GitHub] incubator-hivemall pull request #167: [HIVEMALL-220] Implement Cofactor
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/167#discussion_r226199666 --- Diff: core/src/main/java/hivemall/mf/CofactorModel.java --- @@ -0,0 +1,629 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.mf; + +import hivemall.fm.Feature; +import hivemall.utils.math.MathUtils; +import hivemall.utils.math.MatrixUtils; +import org.apache.commons.math3.linear.ArrayRealVector; +import org.apache.commons.math3.linear.Array2DRowRealMatrix; +import org.apache.commons.math3.linear.RealMatrix; +import org.apache.commons.math3.linear.RealVector; +import org.apache.commons.math3.linear.SingularValueDecomposition; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import javax.annotation.Nullable; +import java.util.*; + +public class CofactorModel { + +public enum RankInitScheme { +random /* default */, gaussian; + +@Nonnegative +protected float maxInitValue; +@Nonnegative +protected double initStdDev; + +@Nonnull +public static CofactorModel.RankInitScheme resolve(@Nullable String opt) { +if (opt == null) { +return random; +} else if ("gaussian".equalsIgnoreCase(opt)) { +return gaussian; +} else if ("random".equalsIgnoreCase(opt)) { +return random; +} +return random; +} + +public void setMaxInitValue(float maxInitValue) { +this.maxInitValue = maxInitValue; +} + +public void setInitStdDev(double initStdDev) { +this.initStdDev = initStdDev; +} + +} + +private static final int EXPECTED_SIZE = 136861; +@Nonnegative +protected final int factor; + +// rank matrix initialization +protected final RankInitScheme initScheme; + +@Nonnull +private double globalBias; + +// storing trainable latent factors and weights +private Map theta; +private Map beta; +private Map betaBias; --- End diff -- Please use `Object2DoubleMap betaBias` instead to reduce memory consumption. ---
[GitHub] incubator-hivemall pull request #166: [HIVEMALL-219] Fixed LDA bug for singl...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/166 [HIVEMALL-219] Fixed LDA bug for single update and added unit tests ## What changes were proposed in this pull request? Fixed LDA bug for single update and added unit tests ## What type of PR is it? Bug Fix ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-219 ## How was this patch tested? unit tests and manual tests on EMR ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [x] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-219-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #166 commit 202eddd71c00e3889c0a126fe1038df35c1513d9 Author: Makoto Yui Date: 2018-09-18T10:36:02Z Fixed LDA bug for single update and added unit tests ---
[GitHub] incubator-hivemall pull request #165: [HIVEMALL-219][BUGFIX] Fixed NPE in fi...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/165 [HIVEMALL-219][BUGFIX] Fixed NPE in finalizeTraining() ## What changes were proposed in this pull request? Fixed NPE in finalizeTraining() where there are no training example ## What type of PR is it? Bug Fix ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-219 ## How was this patch tested? to appear ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-219 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #165 commit bc0e14d1d29ba13b173165bca9d9511b19abbc6e Author: Makoto Yui Date: 2018-09-18T09:42:06Z Fixed NPE in finalizeTraining() ---
[GitHub] incubator-hivemall pull request #164: [HIVEMALL-218] Fixed train_lda NPE whe...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/164 [HIVEMALL-218] Fixed train_lda NPE where input row is null ## What changes were proposed in this pull request? Fixed NegativeArraySizeException where input is NULL of `train_lda` ## What type of PR is it? Bug Fix ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-218 ## How was this patch tested? manual tests ## Checklist - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [x] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-218 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #164 commit 67f6f68acad09c7a0e70f9fbdb183116eeec6a1d Author: Makoto Yui Date: 2018-09-07T08:56:43Z Fixed NegativeArraySizeException where input is NULL commit d367de34e34d42514c0bb6141fbf31f295e33e50 Author: Makoto Yui Date: 2018-09-07T09:15:05Z Fixed NPE in forward() ---
[GitHub] incubator-hivemall issue #163: [HIVEMALL-196][WIP] Support BM25 scoring
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/163 Please add a unit test and evaluate this function on Hive environment. ---
[GitHub] incubator-hivemall pull request #163: [HIVEMALL-196][WIP] Support BM25 scori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/163#discussion_r215564184 --- Diff: core/src/main/java/hivemall/ftvec/text/OkapiBM25UDF.java --- @@ -0,0 +1,167 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.ftvec.text; + +import hivemall.UDFWithOptions; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.hadoop.hive.ql.exec.Description; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.metadata.Hive; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import hivemall.utils.hadoop.HiveUtils; +import org.apache.hadoop.hive.ql.udf.UDFType; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils; +import org.apache.hadoop.io.DoubleWritable; + +import javax.annotation.Nonnull; +import java.util.Arrays; + +@Description(name = "okapi_bm25", +value = "_FUNC_(double tf_word, int dl, double avgdl, int N, int n [, const string options]) - Return an Okapi BM25 score in float") +//TODO: What does stateful mean? --- End diff -- https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/ql/udf/UDFType.html#stateful() So, it's okey `stateful = false`. Please remove this comment. ---
[GitHub] incubator-hivemall pull request #162: [HIVEMALL-217] Resolve missing links f...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/162#discussion_r215535078 --- Diff: docs/gitbook/tips/emr.md --- @@ -21,15 +21,15 @@ ## Prerequisite Learn how to use Hive with Elastic MapReduce (EMR). -http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive.html +https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html Before launching an EMR job, * create ${s3bucket}/emr/outputs for outputs * optionally, create ${s3bucket}/emr/logs for logging -* put [emr_hivemall_bootstrap.sh](https://raw.github.com/myui/hivemall/master/scripts/misc/emr_hivemall_bootstrap.sh) on ${s3bucket}/emr/conf +* put [emr_hivemall_bootstrap.sh](https://raw.githubusercontent.com/apache/incubator-hivemall/master/resources/misc/emr_hivemall_bootstrap.sh) on ${s3bucket}/emr/conf Then, lunch an EMR job with hive in an interactive mode. -I'm usually lunching EMR instances with cheap Spot instances through [CLI client](http://aws.amazon.com/developertools/2264) as follows: +I'm usually lunching EMR instances with cheap Spot instances through [CLI client](https://aws.amazon.com/jp/tools/) as follows: --- End diff -- should be `https://aws.amazon.com/tools/` ---
[GitHub] incubator-hivemall pull request #162: [HIVEMALL-217] Resolve missing links f...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/162#discussion_r214870585 --- Diff: docs/gitbook/tips/hadoop_tuning.md --- @@ -75,13 +75,13 @@ feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance i ``` > 2^24 * 4 bytes * 2 * 1.2 â 161MB -When [SpaceEfficientDenseModel](https://github.com/apache/incubator-hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows: +When [SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows: --- End diff -- `github.com/myui` is deprecated. Use https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/model/SpaceEfficientDenseModel.java instead other appearance of `github.com/myui` as well. ---
[GitHub] incubator-hivemall pull request #160: [HIVEMALL-163] Add IS_INFINITE, IS_FIN...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/160#discussion_r214800712 --- Diff: core/src/main/java/hivemall/tools/math/IsInfiniteUDF.java --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package hivemall.tools.math; + +import org.apache.hadoop.hive.ql.exec.Description; +import org.apache.hadoop.hive.ql.exec.UDF; + +@Description(name = "is_infinite", value = "_FUNC_(x) - Determine if x is infinite.") +public final class IsInfiniteUDF extends UDF { +public Boolean evaluate(Double num) { +if (num == null) { +return null; +} else { +return !num.isNaN() && num.isInfinite(); --- End diff -- Is `!num.isNaN() &&` required? ---
[GitHub] incubator-hivemall pull request #161: [HIVEMALL-216] Fix Docker image based ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/161#discussion_r214793366 --- Diff: docs/gitbook/docker/getting_started.md --- @@ -17,29 +17,31 @@ under the License. --> +# Getting started with Hivemall on Docker + This page introduces how to run Hivemall on Docker. > Caution > This docker image contains a single-node Hadoop enviroment for evaluating Hivemall. Not suited for production uses. -# Requirements +## Requirements * Docker Engine 1.6+ * Docker Compose 1.10+ -# 1. Build image +## 1. Build image --- End diff -- Could you remove `1.` and `2.`? See what's happing in http://hivemall.incubator.apache.org/userguide/docker/getting_started.html#1-build-image ---
[GitHub] incubator-hivemall pull request #159: [HIVEMALL-214][DOC] Update userguide f...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/159 [HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor example ## What changes were proposed in this pull request? Refine user guide for generic classifier/regressor and so on. ## What type of PR is it? Documentation ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-214 ## How to use this feature? See user guide. You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-214 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #159 commit 6f40c466e21c78238a74f9c2f227df8ae156b3e2 Author: Makoto Yui Date: 2018-08-31T07:38:17Z Added general classifier example using a9a dataset commit 4963b63ab685aa539c6c0f5f3cd3230215ba4df7 Author: Makoto Yui Date: 2018-08-31T07:46:31Z Added assertions for deprecated contents commit 472821279d70e4171b7cf391a09bac10c95e28cb Author: Makoto Yui Date: 2018-08-31T08:02:13Z Capitalized topics and fixed a typo commit 649e77840ff154bd75cd7c1bfdfc245516b68b0d Author: Makoto Yui Date: 2018-08-31T11:18:50Z Refined user guide ---
[GitHub] incubator-hivemall issue #158: [HIVEMALL-215] Add step-by-step tutorial on S...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/158 @chezou Merged. Thank you for your first contribution! ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214236762 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,457 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall --- End diff -- Remove obvious `with Apache Hivemall` ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214222772 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,461 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history as +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_history; +``` + +> 5 + --- End diff -- General introduction to Apache Hive and HiveQL is not required for Hivemall's document. The base document is for introducing Hivemall to TD's customers who might not aware differences of Hive and Presto. You can start with `Apache Hivemall is a ... lines of query as follows:` ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214226384 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,461 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history as +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_history; +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows current Hivemall version, for example: + +```sql +select hivemall_version(); +``` + +> "0.5.1-incubating-SNAPSHOT" + +Below we list ML and relevant problems that Hivemall can solve: + +- [Binary and multi-class classification](../binaryclass/general.html) +- [Regression](../regression/general.html) +- [Recommendation](../recommend/cf.html) +- [Anomaly detection](../anomaly/lof.html) +- [Natural language processing](../misc/tokenizer.html) +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling) +- [Data sketching](../misc/funcs.html#sketching) +- Evaluation + +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful to understand more about an overview of Hivemall. + +This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier. + +## Binary classification + +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history` data and predict unforeseen purchases to conduct a new campaign effectively: + +| day\_of\_week | gender | price | category | label | +|:---:|:---:|:---:|:---:|:---| +|Saturday | male | 600 | book | 1 | +|Friday | female | 4800 | sports | 0 | +|Friday | other | 18000 | entertainment | 0 | +|Thursday | male | 200 | food | 0 | +|Wednesday | female | 1000 | electronics | 1 | + +Use Hivemall [`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle the problem as follows. + +### Step 1. Feature representation + +First of all, we have to convert the records into pairs of the feature vector and corresponding target value. Here, Hivemall requires you to represent input features in a specific format. + +To be more precise, Hivemall represents single feature in a concatenation of **index** (i.e., **name**) and its **value**: + +- Quantitative feature: `:` + - e.g., `price:600.0` +- Categorical feature: `#` + - e.g., `gender#male` + +Each of those features is a string value in Hive, and "feature vector" means an array of string values like: + +``` +["price:600.0", "day of week#Saturday", "gender#male", "category#book"] +``` + +See also more detailed [document for input format](../getting_started/input-format.html)). + +Therefore, what we first need to do is to convert the records into an array of feature strings, and Hivemall functions [
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223029 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,461 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history as +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_history; +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows current Hivemall version, for example: + +```sql +select hivemall_version(); +``` + +> "0.5.1-incubating-SNAPSHOT" + +Below we list ML and relevant problems that Hivemall can solve: + +- [Binary and multi-class classification](../binaryclass/general.html) +- [Regression](../regression/general.html) +- [Recommendation](../recommend/cf.html) +- [Anomaly detection](../anomaly/lof.html) +- [Natural language processing](../misc/tokenizer.html) +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling) +- [Data sketching](../misc/funcs.html#sketching) +- Evaluation + +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful to understand more about an overview of Hivemall. + +This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier. + +## Binary classification + +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history` data and predict unforeseen purchases to conduct a new campaign effectively: + +| day\_of\_week | gender | price | category | label | +|:---:|:---:|:---:|:---:|:---| +|Saturday | male | 600 | book | 1 | +|Friday | female | 4800 | sports | 0 | +|Friday | other | 18000 | entertainment | 0 | +|Thursday | male | 200 | food | 0 | +|Wednesday | female | 1000 | electronics | 1 | + --- End diff -- Insert here something like.. You can create this table as follows: ```sql create table if not exists purchase_history as .. ``` ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223937 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,461 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history as +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_history; +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows current Hivemall version, for example: + +```sql +select hivemall_version(); +``` + +> "0.5.1-incubating-SNAPSHOT" + +Below we list ML and relevant problems that Hivemall can solve: + +- [Binary and multi-class classification](../binaryclass/general.html) +- [Regression](../regression/general.html) +- [Recommendation](../recommend/cf.html) +- [Anomaly detection](../anomaly/lof.html) +- [Natural language processing](../misc/tokenizer.html) +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling) +- [Data sketching](../misc/funcs.html#sketching) +- Evaluation + +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful to understand more about an overview of Hivemall. + +This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier. + +## Binary classification + +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history` data and predict unforeseen purchases to conduct a new campaign effectively: + +| day\_of\_week | gender | price | category | label | +|:---:|:---:|:---:|:---:|:---| +|Saturday | male | 600 | book | 1 | +|Friday | female | 4800 | sports | 0 | +|Friday | other | 18000 | entertainment | 0 | +|Thursday | male | 200 | food | 0 | +|Wednesday | female | 1000 | electronics | 1 | + +Use Hivemall [`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle the problem as follows. + +### Step 1. Feature representation + +First of all, we have to convert the records into pairs of the feature vector and corresponding target value. Here, Hivemall requires you to represent input features in a specific format. + +To be more precise, Hivemall represents single feature in a concatenation of **index** (i.e., **name**) and its **value**: + +- Quantitative feature: `:` + - e.g., `price:600.0` +- Categorical feature: `#` + - e.g., `gender#male` + --- End diff -- Better to insert the following sentence after the example. Feature index and feature value are separated by comma. When comma is omitted, the value is considered to be `1.0`. So, a categorical feature `gender#male` a [one-hot representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science) of `index := gender#male` and `value := 1.0`. Note that `#` is not a special charactor. ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890176 --- Diff: docs/gitbook/getting_started/tutorial.md --- @@ -0,0 +1,493 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history +(id bigint, day_of_week string, price int, category string, label int) +; +``` + + +```sql +insert overwrite table purchase_history +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_log +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others) shows current Hivemall version that is available on TD, for example: + +```sql +select hivemall_version() +``` + +> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018) + +Below we list ML and relevant problems that Hivemall and TD can solve: + +- Binary and multi-class classification +- Regression +- Recommendation +- Anomaly detection +- Natural language processing +- Clustering (i.e., topic modeling) +- Data sketching +- Evaluation + +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful to understand more about an overview of Hivemall. + +This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier. + +## Binary classification + +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history` data and predict unforeseen purchases to conduct a new campaign effectively: + +| day\_of\_week | gender | price | category | label | +|:---:|:---:|:---:|:---:|:---| +|Saturday | male | 600 | book | 1 | +|Friday | female | 4800 | sports | 0 | +|Friday | other | 18000 | entertainment | 0 | +|Thursday | male | 200 | food | 0 | +|Wednesday | female | 1000 | electronics | 1 | + +Use Hivemall [`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification) UDF to tackle the problem as follows. + +### Step 1. Feature representation + +First of all, we have to convert the records into pairs of the feature vector and corresponding target value. Here, Hivemall requires you to represent input features in a specific format. + +To be more precise, Hivemall represents single feature in a concatenation of **index** (i.e., **name**) and its **value**: + +- Quantitative feature: `:` + - e.g., `price:600.0` +- Categorical feature: `#` + - e.g., `gender#male` + +Each of those features is a string value in Hive, and "feature vector" means an array of string values like: + +``` +["price:600.0", "day of week#Saturday", "gender#male", "category#book"] +``` + +Therefore, what we first need to do is to convert the records into an array of feature strings
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890053 --- Diff: docs/gitbook/getting_started/tutorial.md --- @@ -0,0 +1,493 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history +(id bigint, day_of_week string, price int, category string, label int) +; +``` + + +```sql +insert overwrite table purchase_history +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_log +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others) shows current Hivemall version that is available on TD, for example: + +```sql +select hivemall_version() +``` + +> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018) + +Below we list ML and relevant problems that Hivemall and TD can solve: --- End diff -- remove `TD` ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890384 --- Diff: docs/gitbook/SUMMARY.md --- @@ -25,6 +25,7 @@ * [Installation](getting_started/installation.md) * [Install as permanent functions](getting_started/permanent-functions.md) * [Input Format](getting_started/input-format.md) +* [Step-by-Step Tutorial on Supervised Learning](getting_started/tutorial.md) --- End diff -- Better moved to `Supervised Learning` or `Regression` section or with renaming. ---
[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890012 --- Diff: docs/gitbook/getting_started/tutorial.md --- @@ -0,0 +1,493 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history +(id bigint, day_of_week string, price int, category string, label int) +; +``` + + +```sql +insert overwrite table purchase_history +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_log +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( +features, +label, +'-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others) shows current Hivemall version that is available on TD, for example: --- End diff -- `TD console` should not appear here. ---
[GitHub] incubator-hivemall pull request #157: [HIVEMALL-212] Fix Classifier/Regresso...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/157 [HIVEMALL-212] Fix Classifier/Regressor not to forward zero weighted values ## What changes were proposed in this pull request? Feature with weight = 0.0 need not to be saved in the prediction model. It is preferable to reduce the size of prediction model. So, this PR fixes Classifier/Regressor not to forward zero weighted values ## What type of PR is it? Improvement ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-212 ## How was this patch tested? unit tests and manual tests ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-212 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #157 commit 48aacae519837a7d69c5927cf0de470d29c6ee29 Author: Makoto Yui Date: 2018-08-27T09:54:16Z Fixed not to hold zero weight features commit 3954a2720502f027ff7f2b5b0cd08e1e77f66017 Author: Makoto Yui Date: 2018-08-27T09:54:43Z Zero division handling commit de16c54dcb7351ea901f81a3a4263eaef347bc60 Author: Makoto Yui Date: 2018-08-28T05:50:25Z Fixed zero weighted feature handling commit ddd88d42536dc2f59efdbcc9dfa86aeda3223a2f Author: Makoto Yui Date: 2018-08-28T05:51:41Z Added final ---
[GitHub] incubator-hivemall issue #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer for re...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/156 confirmed optimizer is working fine using a9a classification. https://gist.github.com/myui/a33a06ff3cf7db0e63ba46ec29703e43 ---
[GitHub] incubator-hivemall issue #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer for re...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/156 @takuti fixed in https://github.com/apache/incubator-hivemall/pull/156/commits/84d1aeb9ca06fd5e6d83686b183543a1d57b06c8 FYI ---
[GitHub] incubator-hivemall pull request #156: [HIVEMALL-211][BUGFIX] Fixed Optimizer...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/156 [HIVEMALL-211][BUGFIX] Fixed Optimizer for regularization updates ## What changes were proposed in this pull request? This PR fixes a bug of regularization scheme of Optimizer. ## What type of PR is it? Bug Fix ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-211 ## How was this patch tested? unit tests, manual tests on EMR ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-211 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/156.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #156 commit 84d1aeb9ca06fd5e6d83686b183543a1d57b06c8 Author: Makoto Yui Date: 2018-08-24T05:54:23Z Fixed regularization scheme and updated Adagrad rule ---
[GitHub] incubator-hivemall issue #155: [HIVEMALL-201-2] Evaluate, fix and document F...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/155 @takuti will merge after EMR tests. FYI ---
[GitHub] incubator-hivemall pull request #155: [HIVEMALL-201-2] Evaluate, fix and doc...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/155 [HIVEMALL-201-2] Evaluate, fix and document FFM ## What changes were proposed in this pull request? Applied some refactoring to #149 This PR closes #149 ## What type of PR is it? Hot Fix, Refactoring ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-201 ## How was this patch tested? unit tests, manual tests ## How to use this feature? Will be published at: http://hivemall.incubator.apache.org/userguide/binaryclass/criteo_ffm.html ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [x] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-201-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/155.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #155 commit c4d6855d6286249e150e4c8dcd5413bcde339990 Author: Takuya Kitazawa Date: 2018-05-16T08:39:32Z Use pre-defined constants in option description commit f7e7e1d49e5fa2e4f4f50d55f85c5cdee3bb69b1 Author: Takuya Kitazawa Date: 2018-05-16T08:40:48Z Fix mismatch between opts.addOption and cl.getOptionValue commit 929781a982f86851e38d558bb79a239d90c90e76 Author: Takuya Kitazawa Date: 2018-05-16T08:41:34Z Support FFM feature format in `l1_normalize` and `l2_normalize` commit a1751361f8ae2204cdc6507514945ebaa1ddf179 Author: Takuya Kitazawa Date: 2018-05-21T06:02:14Z Increase `alphaFTRL` in `testSampleEnableNorm` for convergence commit ff049d776133d1bc0cf7e62d9740f22a3943f593 Author: Takuya Kitazawa Date: 2018-05-22T02:16:51Z Fix typo commit 35a02451fc4e8a55bbb49b7fede3c545145b7d6e Author: Takuya Kitazawa Date: 2018-05-22T05:22:35Z Fix bug in forward model Due to typo, linear weights in model are not correctly forwarded. commit 9782136e3059df1d334c814c9eb9455e1ec9b573 Author: Takuya Kitazawa Date: 2018-05-22T06:39:22Z Fix order of computing AdaGrad learning rate * Gradient includes regularization term * Get sum of squared gradient after adding the latest gradient See: https://github.com/guestwalk/libffm/blob/7db5b4f1ad3af7eb5bd0c224b2fa5305e1a715d2/ffm.cpp#L219-L226 commit 2366d910581248249a4e69e1110675469a17ea99 Author: Takuya Kitazawa Date: 2018-05-22T06:47:03Z Enable to specify initial learn rate for AdaGrad commit f1fd20cd508a8473bd0fef037cd708d5c3379c5f Author: Takuya Kitazawa Date: 2018-05-22T08:35:36Z Make `-max_init_value` more meaningful In fact, the code sampled random value from [0, max_init_value / k], but users expect that each element in V is exactly initialized random values in [0, max_init_value]. commit 478f26dab385b3835cdfbe19d40beef47336d92d Author: Takuya Kitazawa Date: 2018-05-23T05:19:17Z Add `-l2norm` option to FeaturePairsUDTF Users can configure if feature vector is L2 normalized in a similar way to `train_ffm`. commit 3627ca84e857210aa921fd607fed19759d26fba0 Author: Takuya Kitazawa Date: 2018-05-23T06:27:02Z Switch `-disable_wi` option to `-enable_wi` commit e2c378f5134c67d25047169324c6aa9df62e8b8f Author: Takuya Kitazawa Date: 2018-05-23T07:01:09Z Fix test broken by change of default learn rate for FFM+AdaGrad commit 056dfde30437c9bbcfcaf292698ba97dfa67 Author: Takuya Kitazawa Date: 2018-05-23T07:27:34Z FFM applies instance-wise L2 normalization by default commit 91aed6ecdc5401d972eac534e54246c59fd15ebb Author: Takuya Kitazawa Date: 2018-05-24T00:48:37Z Increase default number of iterations to rely more on cv_test commit dca7e5762d664039354d00da8c3ca9adccd5d7c2 Author: Takuya Kitazawa Date: 2018-05-24T04:23:24Z Make default L2 regularization parameter smaller New default value 0.0001 is same as FTRL and general regressor/classifier. 0.01 was large on small data; a model cannot be successfully learnt in some cases. By contrast, LIBFFM uses very small value 0.2 by default. This commit sets 0.0001, a middle of these values, as a compromise. commit f84c960285f04ada21fb346e94ed0b5683d31289 Author: Takuya Kitazawa Date: 2018-05-24T04:49:27Z Increase default learn rate from 0.05 to 0.1 Referred the following implementations. LIBFFM: 0.2 (with AdaGrad) https://github.com/guestwalk/libffm/blob/740103e5eb920a4061dd8e977a2ede6d23c6910a/ffm.h#L31 libFM: 0.1 https://github.com/srendle/libfm
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r211470802 --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java --- @@ -123,17 +117,18 @@ void updateWi(final double dloss, @Nonnull final Feature x, final long t) { } final double Xi = x.getValue(); -float gradWi = (float) (dloss * Xi); final Entry theta = getEntryW(x); float wi = theta.getW(); -final float eta = eta(theta, t, gradWi); -float nextWi = wi - eta * (gradWi + 2.f * _lambdaW * wi); +float grad = (float) (dloss * Xi + 2.f * _lambdaW * wi); --- End diff -- regularization should not be performed here (?) ---
[GitHub] incubator-hivemall issue #139: [HIVEMALL-182][SPARK][WIP] Add an optimizer r...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/139 @maropu is this PR still WIP? ---
[GitHub] incubator-hivemall issue #154: [HIVEMALL-210][BUGFIX] Fix a bug in lda_predi...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/154 @takuti thank you for the comments. Reflected your reviews. ---
[GitHub] incubator-hivemall issue #154: [HIVEMALL-210][BUGFIX] Fix a bug in lda_predi...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/154 @takuti could you review this PR? ---
[GitHub] incubator-hivemall pull request #154: [HIVEMALL-210][BUGFIX] Fix a bug in ld...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/154 [HIVEMALL-210][BUGFIX] Fix a bug in lda_predict/plsa_predict ## What changes were proposed in this pull request? Fixed a bug in lda_predict/plsa_predict that duplicated term probability is [unexpectedly replaced](https://github.com/apache/incubator-hivemall/blame/a8a97d6e873d5a8a30b06f92ddc14d1ec95c2738/core/src/main/java/hivemall/topicmodel/LDAPredictUDAF.java#L396) ## What type of PR is it? Bug Fix ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-210 ## How was this patch tested? unit tests and manual tests ## Checklist - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-210 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/154.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #154 commit 4e38897afc7af92e82635198359103d79b25dc82 Author: Makoto Yui Date: 2018-08-04T17:55:59Z Added sortable KeyValue structs commit 2ab5bf5cf3862f20e7c5aa096cf8d7c65cde9b50 Author: Makoto Yui Date: 2018-08-04T17:56:37Z Fixed a bug in lda_predict and plsa_predict ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r205390805 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java --- @@ -399,9 +399,8 @@ public void initRandom(int factor, long seed) { protected static final void uniformFill(final float[] a, final Random rand, final float maxInitValue) { final int len = a.length; -final float basev = maxInitValue / len; for (int i = 0; i < len; i++) { -float v = rand.nextFloat() * basev; +float v = rand.nextFloat() * maxInitValue; --- End diff -- While this modified `random` initialization is not used for classification (and only for regression), your evaluation is only for classification. This, it's doubtful that this change contributed for improving accuracy. ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r200611442 --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java --- @@ -51,11 +50,6 @@ public FieldAwareFactorizationMachineModel(@Nonnull FFMHyperParameters params) { super(params); this._params = params; -if (params.useAdaGrad) { -this._eta0 = 1.0f; --- End diff -- better to use large default eta0 for adagrad. ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r200605210 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful srcBuf.clear(); } -public void train(@Nonnull final Feature[] x, final double y, -final boolean adaptiveRegularization) throws HiveException { +protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException { _model.check(x); +} + +protected void processValidationSample(@Nonnull final Feature[] x, final double y) +throws HiveException { +if (_adaptiveRegularization) { +trainLambda(x, y); // adaptive regularization +} +if (_earlyStopping) { +double p = _model.predict(x); +double loss = _lossFunction.loss(p, y); +_validationState.incrLoss(loss); +} +} + +public void train(@Nonnull final Feature[] x, final double y, final boolean validation) +throws HiveException { +checkInputVector(x); --- End diff -- avoid too many virtual method call. `_model.check(x);` is enough both for FM and FFM. ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r200604967 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful srcBuf.clear(); } -public void train(@Nonnull final Feature[] x, final double y, -final boolean adaptiveRegularization) throws HiveException { +protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException { _model.check(x); +} + +protected void processValidationSample(@Nonnull final Feature[] x, final double y) +throws HiveException { +if (_adaptiveRegularization) { +trainLambda(x, y); // adaptive regularization +} +if (_earlyStopping) { --- End diff -- earlyStopping is better to be performed before adaptiveRegularization. ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r200590772 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -283,9 +293,16 @@ public void process(Object[] args) throws HiveException { } ++_t; -recordTrain(x, y); -boolean adaptiveRegularization = (_va_rand != null) && _t >= _validationThreshold; -train(x, y, adaptiveRegularization); + +boolean validation = false; +if ((_va_rand != null) && _t >= _validationThreshold) { +final float rnd = _va_rand.nextFloat(); +validation = rnd < _validationRatio; +} + +recordTrain(x, y, validation); + +train(x, y, validation); --- End diff -- Validation examples are fixed in this implementation. Also, not using non-validation examples for regularization is a bad strategy. ---
[GitHub] incubator-hivemall issue #153: [HIVEMALL-208] Upgrade to Lucene 5.5.5
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/153 @iijima-satoshi LGTM. Merged. Thank you for your contribution! ---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r200093417 --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java --- @@ -259,9 +255,9 @@ protected final float eta(@Nonnull final Entry theta, final long t, final float protected final float eta(@Nonnull final Entry theta, @Nonnegative final int f, final long t, final float grad) { if (_useAdaGrad) { -double gg = theta.getSumOfSquaredGradients(f); --- End diff -- @takuti This behavior (that used in libffm) is wrong in strict sense and previous code is much better because initial eta should equals to `eta0` but this implementation depends on the initial gradient. ---
[GitHub] incubator-hivemall issue #153: [HIVEMALL-208] Upgrade to Lucene 5.5.5
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/153 @iijima-satoshi Thank you for the contribution. Will merge testing. @takuti You need to update Lucene version to `5.5.5` in `tokenize_ja_kuromoji`. https://github.com/takuti/hive-udf-neologd/blob/master/pom.xml#L16 ---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 > -lambda 0.0001 (default), -init_v adjusted_random 0.6756640217829124 0.8644404496920104 > -lambda 0.001, -init_v adjusted_random 0.6749224090640931 0.8642914100412997 > -lambda 0.002, -init_v adjusted_random 0.6729486759257253 0.862249033512779 > -lambda 0.01, -init_v adjusted_random 0.6728088660666263 0.8568219312625348 ⢠libfm ``` eta=0.1 init_stdev=0.1 reg0 = 0.0; regw = 0.0; regv = 0.0; ``` https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L87 https://github.com/srendle/libfm/blob/30b9c799c41d043f31565cbf827bf41d0dc3e2ab/src/fm_core/fm_model.h#L73 ⢠libffm ``` eta = 0.1; // learning rate lambda = 0.2; // regularization parameter nr_iters = 15; k = 4; // number of latent factors ``` https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L84 ---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 revising in https://github.com/myui/incubator-hivemall/commits/HIVEMALL-201-2 ---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 @takuti with the modified default hyperparameter of FM, the performance of FM is getting worse. Before > 0.6736798239047873 (mae) 0.858938110314545 (rmse) After > 0.6837803085633278 (mae) 0.876690912076831 (rmse) http://hivemall.incubator.apache.org/userguide/recommend/movielens_fm.html ---
[GitHub] incubator-hivemall pull request #151: Relocated org.codehaus.jackson to hive...
GitHub user myui opened a pull request: https://github.com/apache/incubator-hivemall/pull/151 Relocated org.codehaus.jackson to hivemall.codehause.jackson in hivemall-all.jar ## What changes were proposed in this pull request? Relocated `org.codehaus.jackson` to `hivemall.codehause.jackson` in hivemall-all.jar because Jackson can be missing in some Hadoop/Hive enviroment ## What type of PR is it? Improvement ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-203 ## How was this patch tested? manual tests ## Checklist (Please remove this section if not needed; check `x` for YES, blank for NO) - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit? - [ ] Did you run system tests on Hive (or Spark)? You can merge this pull request into a Git repository by running: $ git pull https://github.com/myui/incubator-hivemall relocate_jackson Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #151 commit a07350f2e8e69a6fcd494df714f1108476b97bc8 Author: Makoto Yui Date: 2018-06-10T10:00:30Z Relocated org.codehaus.jackson to hivemall.codehause.jackson in hivemall-all.jar ---
[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/135 I'm going to merge this PR to master. If you find any problem, please comment here. ---
[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/135 For K-minimum Values (KMV) and Sketch related codes, I'll create an another JIRA ticket. For other UDFs, we accept incoming PRs. https://docs.google.com/spreadsheets/d/1gtFNcTvPR9OZAsbobj2D9d37tOx4nAoSlib9CLdEDQg/edit#gid=0 ---
[GitHub] incubator-hivemall issue #135: [WIP][HIVEMALL-145] Merge Brickhouse function...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/135 @jeromebanks I'm considering to merge this PR. Could you review if possible? ---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 We need to remain the default hyperparameter of FM as it is for backward compatibility. I'll take care of it on merging. ---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 @takuti Sure. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r191645232 --- Diff: core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java --- @@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, IOException { cumulativeLoss > udtf._validationState.getCumulativeLoss()); } +@Test(expected = IllegalArgumentException.class) +public void testUnsupportedAdaptiveRegularizationOption() throws Exception { + TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class, +new ObjectInspector[] { +ObjectInspectorFactory.getStandardListObjectInspector( + PrimitiveObjectInspectorFactory.javaStringObjectInspector), + PrimitiveObjectInspectorFactory.javaDoubleObjectInspector, +ObjectInspectorUtils.getConstantObjectInspector( + PrimitiveObjectInspectorFactory.javaStringObjectInspector, +"-seed 43 -adaptive_regularization")}, +new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}}); --- End diff -- Better to compare accuracy against the default regularization. In general, it should be better than the default one. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309443 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java --- @@ -92,6 +92,14 @@ protected float getW(int i) { protected abstract void setW(@Nonnull Feature x, float nextWi); +protected void setW(int i, float nextWi) { --- End diff -- No need to have `protected void setW(int i, float nextWi)` and `protected void setW(@Nonnull String j, float nextWi)` in FactorizationMachineModel. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r191308798 --- Diff: core/src/main/java/hivemall/fm/FMArrayModel.java --- @@ -80,6 +80,11 @@ public float getW(@Nonnull final Feature x) { @Override protected void setW(@Nonnull Feature x, float nextWi) { int i = x.getFeatureIndex(); +setW(i, nextWi); --- End diff -- better to avoid method call. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309836 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java --- @@ -92,6 +92,14 @@ protected float getW(int i) { protected abstract void setW(@Nonnull Feature x, float nextWi); +protected void setW(int i, float nextWi) { --- End diff -- `setW(int i, float nextWi)` is no more used when avoid caching in early stopping. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r191298514 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException _model.check(x); } +protected void processValidationSample(@Nonnull final Feature[] x, final double y) +throws HiveException { +if (_adaptiveRegularization) { +trainLambda(x, y); // adaptive regularization --- End diff -- `FFM fully ignores adaptive regularization option` is expected behavior. Not tested AdaptiveRegularization with FFM and/or FTRL. ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 This kind of behavior could often be happen and Libffm's early stopping strategy is too aggressive. ``` 7 0.43239 0.46952 8 0.42362 0.46999 9 0.41394 0.45088 ``` ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 ``` iter tr_logloss va_logloss 1 0.49738 0.48776 2 0.47383 0.47995 3 0.46366 0.47480 4 0.45561 0.47231 5 0.44810 0.47034 6 0.44037 0.47003 7 0.43239 0.46952 8 0.42362 0.46999 <- ffm stops one va_logloss is increased but va_logloss might decrease in the next iteration 9 0.41394 0.47088 <- once ``` In 8-th iteration, `ready to stop once va_logloss increase`. If va_logloss descreases in the 9th iteration, then continue iteration (set not ready to finish). If va_logloss increases in the 9th iteration, then emit the current model in the 9th iteration. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r190843344 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful srcBuf.clear(); } +protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException { +_model.check(x); +} + public void train(@Nonnull final Feature[] x, final double y, final boolean adaptiveRegularization) throws HiveException { -_model.check(x); +checkInputVector(x); try { if (adaptiveRegularization) { --- End diff -- I think there are no need to share `train` if `adaptiveRegularization` is always be off for FFM and `early_stopping` is always off for FM. The logic in train becomes complex. ---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/149#discussion_r190842171 --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java --- @@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) throws HiveException { inputBuf.flip(); for (int iter = 2; iter <= iterations; iter++) { +if (earlyStopValidation) { --- End diff -- better to avoid many `if (earlyStopValidation) {`. `_validateState` can always be non-null when `if(earlyStopValidation && _validateState.isLossIncreased()` never be true. ---
[GitHub] incubator-hivemall issue #150: update conv.awk location
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/150 Merged Thanks. ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 It might be better to reconsider `eta0` when enabling `l2norm` by the default and by enlarging`max_init_size`. In my experience for FM, init random size should be small when the avg feature dimension is large (gradients will be large). I think `1.0` is too aggressive for the default though. `0.2` or `0.5`? Better to research other implementations. ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 @takuti so then, better to enable l2_norm by the default and `-disable_l2norm` to disable l2 normalization. My concern is that L2 normalization performed worse for small datasets with adequate learning rate `[0.1,1.0]`. FieldAwareFactorizationMachineUDTFTest contains several tests. It's better to find that accuracy will not be bad with new default options, enabling L2 normalization. ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 Also, it's better to revise default `-iters` from 1 to 10 (at least 10 iterations with early stopping). ---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/149 BTW, it might be better to implement `early stopping` using validation data. https://github.com/guestwalk/libffm We can use a similar approaches to `_validationRatio` used in `FactorizationMachineUDTF` instead of preparing validation dataset. ---