[GitHub] incubator-hivemall issue #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread coveralls
Github user coveralls commented on the issue:

https://github.com/apache/incubator-hivemall/pull/116
  

[![Coverage 
Status](https://coveralls.io/builds/13472784/badge)](https://coveralls.io/builds/13472784)

Coverage decreased (-0.6%) to 40.508% when pulling 
**0b163fade6f2d26ce918211c94a78c9a3b648cbe on nzw0301:skipgram** into 
**1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---


[GitHub] incubator-hivemall issue #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread coveralls
Github user coveralls commented on the issue:

https://github.com/apache/incubator-hivemall/pull/116
  

[![Coverage 
Status](https://coveralls.io/builds/13472440/badge)](https://coveralls.io/builds/13472440)

Coverage decreased (-0.6%) to 40.505% when pulling 
**8696f5ff668adf758d3545bab5885e51ce7d053e on nzw0301:skipgram** into 
**1e42387576fabbb326d451f4a00ac22d57828711 on apache:master**.



---


[GitHub] incubator-hivemall issue #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/incubator-hivemall/pull/116
  
@myui I resolved conflicts.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141556886
  
--- Diff: core/src/main/java/hivemall/embedding/SkipGramModel.java ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public final class SkipGramModel extends AbstractWord2VecModel {
+protected SkipGramModel(final int dim, final int win, final int neg, 
final int iter,
--- End diff --

Lot's of hyperparameters in constructor.

Consider using Hyperparameter class as seen in 
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/fm/FMHyperParameters.java


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141556621
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141556391
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
--- End diff --

2^n or 1024 * 10 is more understandable.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141553131
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141551510
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
--- End diff --

There is no reason.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141550040
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
--- End diff --

What's `10578`?


---


[GitHub] incubator-hivemall issue #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/116
  
@nzw0301 Please rebase to master resolving ^ conflicts.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141547369
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543986
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545337
  
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality

+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+
+
+>  Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as 
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:: |:|
+|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
+|  ...  | ... |
+
+First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+  select
+docid,
+tokenize(doc, true) as words
+  FROM
+docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:: |:|
+|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+|  ...  | ... |
+
+Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+  row_number() over () - 1 as wordid,
+  word,
+  freq
+from (
+  select
+word,
+COUNT(*) as freq
+  from
+docs_words
+  LATERAL VIEW explode(words) lTable as word
+  group by
+word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
--- End diff --

remove line break after `,`.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141544782
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141546893
  
--- Diff: core/src/main/java/hivemall/embedding/CBoWModel.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public final class CBoWModel extends AbstractWord2VecModel {
+protected CBoWModel(final int dim, final int win, final int neg, final 
int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+super(dim, win, neg, iter, startingLR, numTrainWords, S, 
aliasWordId);
+}
+
+protected void trainOnDoc(@Nonnull final int[] doc) {
+final int vecDim = dim;
+final int numNegative = neg;
+final PRNG _rnd = rnd;
+final Int2FloatOpenHashTable _S = S;
+final int[] _aliasWordId = aliasWordId;
+float label, gradient;
+
+// reuse instance
+int windowSize, k, numContext, targetWord, inWord, positiveWord;
+
+updateLearningRate();
+
+int docLength = doc.length;
--- End diff --

`final int docLength`


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543209
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.inputWeights.defaultReturnValue(0.f);
+this.contextWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.contextWeights.defaultReturnValue(0.f);
+}
+
+private static float[] initSigmoidTable() {
+float[] sigmoidTable = new float[SIGMOID_TABLE_SIZE];
+for (int i = 0; i < SIGMOID_TABLE_SIZE; i++) {
+float x = ((float) i / SIGMOID_TABLE_SIZE * 2 - 1) * (float) 
MAX_SIGMOID;
+sigmoidTable[i] = 1.f / ((float) Math.exp(-x) + 1.f);
+}
+return sigmoidTable;
+}
+
+protected void initWordWeights(final int wordId) {
+for (int i = 0; i < dim; i++) {
+inputWeights.put(wordId * dim + i, ((float) rnd.nextDouble() - 
0.5f) / dim);
+}
+}
+
+protected static float sigmoid(final float v, final float[] 
sigmoidTable) {
--- End diff --

`@Nonnull` for sigmoidTable


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545257
  
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality

+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+
+
+>  Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as 
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:: |:|
+|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
+|  ...  | ... |
+
+First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+  select
+docid,
+tokenize(doc, true) as words
+  FROM
+docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:: |:|
+|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+|  ...  | ... |
+
+Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+  row_number() over () - 1 as wordid,
+  word,
+  freq
+from (
+  select
+word,
+COUNT(*) as freq
+  from
+docs_words
+  LATERAL VIEW explode(words) lTable as word
+  group by
+word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
+not sub-sampled words are ignored.
+It works to train fastly and to consider the imbalance the rare words and 
frequent words by reducing frequent words.
+The smaller `sample` value set,
+the fewer words are used during training.
+
+```sql
+set hivevar:sample=1e-4;
+
+drop table subsampling_table;
+create table subsampling_table as
+with stats as (
+  select
+sum(freq) as numTrainWords
+  FROM
+freq
+)
+select
+  l.wordid,
+  l.word,
+  sqrt(${sample}/(l.freq/r.numTrainWords)) + 
${sample}/(l.freq/r.numTrainWords) as p
+from
+  freq l
+cross join
+  stats r
+;
+```
+
+```sql
+select * FROM subsampling_table order by p;
+```
+
+| wordid | word | p |
+|:: | :: |::|
+| 48645 | the  | 0.04013665|
+| 11245 | of   | 0.052463654|
+| 16368 | and  | 0.0638|
+| 61938 | 00   | 0.068162076|
+| 19977 | in   | 0.071441144|
+| 83599 | 0| 0.07528994|
+| 95017 | a| 0.07559573|
+| 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141544983
  
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality

+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+
+
+>  Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as 
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:: |:|
+|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
+|  ...  | ... |
+
+First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+  select
+docid,
+tokenize(doc, true) as words
+  FROM
+docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:: |:|
+|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+|  ...  | ... |
+
+Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+  row_number() over () - 1 as wordid,
+  word,
+  freq
+from (
+  select
+word,
+COUNT(*) as freq
+  from
+docs_words
+  LATERAL VIEW explode(words) lTable as word
+  group by
+word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
+not sub-sampled words are ignored.
+It works to train fastly and to consider the imbalance the rare words and 
frequent words by reducing frequent words.
+The smaller `sample` value set,
+the fewer words are used during training.
+
+```sql
+set hivevar:sample=1e-4;
+
+drop table subsampling_table;
+create table subsampling_table as
+with stats as (
+  select
+sum(freq) as numTrainWords
+  FROM
+freq
+)
+select
+  l.wordid,
+  l.word,
+  sqrt(${sample}/(l.freq/r.numTrainWords)) + 
${sample}/(l.freq/r.numTrainWords) as p
+from
+  freq l
+cross join
+  stats r
+;
+```
+
+```sql
+select * FROM subsampling_table order by p;
+```
+
+| wordid | word | p |
+|:: | :: |::|
+| 48645 | the  | 0.04013665|
+| 11245 | of   | 0.052463654|
+| 16368 | and  | 0.0638|
+| 61938 | 00   | 0.068162076|
+| 19977 | in   | 0.071441144|
+| 83599 | 0| 0.07528994|
+| 95017 | a| 0.07559573|
+| 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543643
  
--- Diff: core/src/main/java/hivemall/embedding/CBoWModel.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public final class CBoWModel extends AbstractWord2VecModel {
+protected CBoWModel(final int dim, final int win, final int neg, final 
int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+super(dim, win, neg, iter, startingLR, numTrainWords, S, 
aliasWordId);
+}
+
+protected void trainOnDoc(@Nonnull final int[] doc) {
+final int vecDim = dim;
+final int numNegative = neg;
+final PRNG _rnd = rnd;
+final Int2FloatOpenHashTable _S = S;
+final int[] _aliasWordId = aliasWordId;
+float label, gradient;
+
+// reuse instance
+int windowSize, k, numContext, targetWord, inWord, positiveWord;
+
+updateLearningRate();
+
+int docLength = doc.length;
+for (int t = 0; t < iter; t++) {
+for (int positiveWordPosition = 0; positiveWordPosition < 
docLength; positiveWordPosition++) {
+windowSize = _rnd.nextInt(win) + 1;
+
+numContext = windowSize * 2 + Math.min(0, 
positiveWordPosition - windowSize)
++ Math.min(0, docLength - positiveWordPosition - 
windowSize - 1);
+
+float[] gradVec = new float[vecDim];
--- End diff --

add `final` for `gradVec` and `averageVec`.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141546846
  
--- Diff: core/src/main/java/hivemall/embedding/CBoWModel.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public final class CBoWModel extends AbstractWord2VecModel {
+protected CBoWModel(final int dim, final int win, final int neg, final 
int iter,
--- End diff --

add a blank line before constructor.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545219
  
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality

+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+
+
+>  Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as 
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:: |:|
+|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
+|  ...  | ... |
+
+First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+  select
+docid,
+tokenize(doc, true) as words
+  FROM
+docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:: |:|
+|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+|  ...  | ... |
+
+Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+  row_number() over () - 1 as wordid,
+  word,
+  freq
+from (
+  select
+word,
+COUNT(*) as freq
+  from
+docs_words
+  LATERAL VIEW explode(words) lTable as word
+  group by
+word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
+not sub-sampled words are ignored.
+It works to train fastly and to consider the imbalance the rare words and 
frequent words by reducing frequent words.
+The smaller `sample` value set,
+the fewer words are used during training.
+
+```sql
+set hivevar:sample=1e-4;
+
+drop table subsampling_table;
+create table subsampling_table as
+with stats as (
+  select
+sum(freq) as numTrainWords
+  FROM
+freq
+)
+select
+  l.wordid,
+  l.word,
+  sqrt(${sample}/(l.freq/r.numTrainWords)) + 
${sample}/(l.freq/r.numTrainWords) as p
+from
+  freq l
+cross join
+  stats r
+;
+```
+
+```sql
+select * FROM subsampling_table order by p;
+```
+
+| wordid | word | p |
+|:: | :: |::|
+| 48645 | the  | 0.04013665|
+| 11245 | of   | 0.052463654|
+| 16368 | and  | 0.0638|
+| 61938 | 00   | 0.068162076|
+| 19977 | in   | 0.071441144|
+| 83599 | 0| 0.07528994|
+| 95017 | a| 0.07559573|
+| 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141547506
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141547708
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545448
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543514
  
--- Diff: core/src/main/java/hivemall/embedding/CBoWModel.java ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public final class CBoWModel extends AbstractWord2VecModel {
+protected CBoWModel(final int dim, final int win, final int neg, final 
int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+super(dim, win, neg, iter, startingLR, numTrainWords, S, 
aliasWordId);
+}
+
+protected void trainOnDoc(@Nonnull final int[] doc) {
+final int vecDim = dim;
+final int numNegative = neg;
+final PRNG _rnd = rnd;
+final Int2FloatOpenHashTable _S = S;
--- End diff --

Member variable should be `_S` and local variable should be `S`.

`_rnd`, `_aliasWordId` as well.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141546656
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.inputWeights.defaultReturnValue(0.f);
+this.contextWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.contextWeights.defaultReturnValue(0.f);
+}
+
+private static float[] initSigmoidTable() {
--- End diff --

`@Nonnull` for return


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545135
  
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality

+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+
+
+>  Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as 
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:: |:|
+|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
+|  ...  | ... |
+
+First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+  select
+docid,
+tokenize(doc, true) as words
+  FROM
+docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:: |:|
+|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+|  ...  | ... |
+
+Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+  row_number() over () - 1 as wordid,
+  word,
+  freq
+from (
+  select
+word,
+COUNT(*) as freq
+  from
+docs_words
+  LATERAL VIEW explode(words) lTable as word
+  group by
+word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
+not sub-sampled words are ignored.
+It works to train fastly and to consider the imbalance the rare words and 
frequent words by reducing frequent words.
+The smaller `sample` value set,
+the fewer words are used during training.
+
+```sql
+set hivevar:sample=1e-4;
+
+drop table subsampling_table;
+create table subsampling_table as
+with stats as (
+  select
+sum(freq) as numTrainWords
+  FROM
+freq
+)
+select
+  l.wordid,
+  l.word,
+  sqrt(${sample}/(l.freq/r.numTrainWords)) + 
${sample}/(l.freq/r.numTrainWords) as p
+from
+  freq l
+cross join
+  stats r
+;
+```
+
+```sql
+select * FROM subsampling_table order by p;
+```
+
+| wordid | word | p |
+|:: | :: |::|
+| 48645 | the  | 0.04013665|
+| 11245 | of   | 0.052463654|
+| 16368 | and  | 0.0638|
+| 61938 | 00   | 0.068162076|
+| 19977 | in   | 0.071441144|
+| 83599 | 0| 0.07528994|
+| 95017 | a| 0.07559573|
+| 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543945
  
--- Diff: core/src/main/java/hivemall/embedding/Word2VecUDTF.java ---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.UDTFWithOptions;
+import hivemall.utils.collections.IMapIterator;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+import hivemall.utils.collections.maps.OpenHashTable;
+import hivemall.utils.hadoop.HiveUtils;
+import hivemall.utils.lang.Primitives;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.Arrays;
+import java.util.ArrayList;
+
+@Description(
+name = "train_word2vec",
+value = "_FUNC_(array negative_table, 
array doc [, const string options]) - Returns a prediction model")
+public class Word2VecUDTF extends UDTFWithOptions {
+protected transient AbstractWord2VecModel model;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+private OpenHashTable word2index;
+
+@Nonnegative
+private int dim;
+@Nonnegative
+private int win;
+@Nonnegative
+private int neg;
+@Nonnegative
+private int iter;
+private boolean skipgram;
+private boolean isStringInput;
+
+private Int2FloatOpenHashTable S;
+private int[] aliasWordIds;
+
+private ListObjectInspector negativeTableOI;
+private ListObjectInspector negativeTableElementListOI;
+private PrimitiveObjectInspector negativeTableElementOI;
+
+private ListObjectInspector docOI;
+private PrimitiveObjectInspector wordOI;
+
+@Override
+public StructObjectInspector initialize(ObjectInspector[] argOIs) 
throws UDFArgumentException {
+final int numArgs = argOIs.length;
+
+if (numArgs != 3) {
+throw new UDFArgumentException(getClass().getSimpleName()
++ " takes 3 arguments:  [, constant string options]: "
++ Arrays.toString(argOIs));
+}
+
+processOptions(argOIs);
+
+this.negativeTableOI = HiveUtils.asListOI(argOIs[0]);
+this.negativeTableElementListOI = 
HiveUtils.asListOI(negativeTableOI.getListElementObjectInspector());
+this.docOI = HiveUtils.asListOI(argOIs[1]);
+
+this.isStringInput = HiveUtils.isStringListOI(argOIs[1]);
+
+if (isStringInput) {
+this.negativeTableElementOI = 
HiveUtils.asStringOI(negativeTableElementListOI.getListElementObjectInspector());
+this.wordOI = 
HiveUtils.asStringOI(docOI.getListElementObjectInspector());
+} else {
+this.negativeTableElementOI = 

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141542877
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
--- End diff --

remove unnecessary blank line


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141543245
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
+protected int neg;
+protected int iter;
+
+// learning rate parameters
+@Nonnegative
+protected float lr;
+@Nonnegative
+private float startingLR;
+@Nonnegative
+private long numTrainWords;
+@Nonnegative
+protected long wordCount;
+@Nonnegative
+private long lastWordCount;
+
+protected PRNG rnd;
+
+protected Int2FloatOpenHashTable contextWeights;
+protected Int2FloatOpenHashTable inputWeights;
+protected Int2FloatOpenHashTable S;
+protected int[] aliasWordId;
+
+protected AbstractWord2VecModel(final int dim, final int win, final 
int neg, final int iter,
+final float startingLR, final long numTrainWords, final 
Int2FloatOpenHashTable S,
+final int[] aliasWordId) {
+this.win = win;
+this.neg = neg;
+this.iter = iter;
+this.dim = dim;
+this.startingLR = this.lr = startingLR;
+this.numTrainWords = numTrainWords;
+
+// alias sampler for negative sampling
+this.S = S;
+this.aliasWordId = aliasWordId;
+
+this.wordCount = 0L;
+this.lastWordCount = 0L;
+this.rnd = RandomNumberGeneratorFactory.createPRNG(1001);
+
+this.sigmoidTable = initSigmoidTable();
+
+// TODO how to estimate size
+this.inputWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.inputWeights.defaultReturnValue(0.f);
+this.contextWeights = new Int2FloatOpenHashTable(10578 * dim);
+this.contextWeights.defaultReturnValue(0.f);
+}
+
+private static float[] initSigmoidTable() {
+float[] sigmoidTable = new float[SIGMOID_TABLE_SIZE];
+for (int i = 0; i < SIGMOID_TABLE_SIZE; i++) {
+float x = ((float) i / SIGMOID_TABLE_SIZE * 2 - 1) * (float) 
MAX_SIGMOID;
+sigmoidTable[i] = 1.f / ((float) Math.exp(-x) + 1.f);
+}
+return sigmoidTable;
+}
+
+protected void initWordWeights(final int wordId) {
+for (int i = 0; i < dim; i++) {
+inputWeights.put(wordId * dim + i, ((float) rnd.nextDouble() - 
0.5f) / dim);
+}
+}
+
+protected static float sigmoid(final float v, final float[] 
sigmoidTable) {
+if (v > MAX_SIGMOID) {
+return 1.f;
+} else if (v < -MAX_SIGMOID) {
+return 0.f;
+} else {
+return sigmoidTable[(int) ((v + MAX_SIGMOID) * 
(SIGMOID_TABLE_SIZE / MAX_SIGMOID / 2))];
+}
+}
+
+protected void updateLearningRate() {
+// TODO: valid lr?
--- End diff --

remove this TODO comment and blank lines.


---


[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

2017-09-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/116#discussion_r141542968
  
--- Diff: core/src/main/java/hivemall/embedding/AbstractWord2VecModel.java 
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.embedding;
+
+import hivemall.math.random.PRNG;
+import hivemall.math.random.RandomNumberGeneratorFactory;
+import hivemall.utils.collections.maps.Int2FloatOpenHashTable;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+import java.util.List;
+
+public abstract class AbstractWord2VecModel {
+// cached sigmoid function parameters
+protected static final int MAX_SIGMOID = 6;
+protected static final int SIGMOID_TABLE_SIZE = 1000;
+protected float[] sigmoidTable;
+
+
+@Nonnegative
+protected int dim;
+protected int win;
--- End diff --

`@Nonnegative` for each variable (win, neg, iter).


---