Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545337 --- Diff: docs/gitbook/embedding/word2vec.md --- @@ -0,0 +1,399 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +Word Embedding is a powerful tool for many tasks, +e.g. finding similar words, +feature vectors for supervised machine learning task and word analogy, +such as `king - man + woman =~ queen`. +In word embedding, +each word represents a low dimension and dense vector. +**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular algorithms to obtain good word embeddings (a.k.a word2vec). + +The papers introduce the method are as follows: + +- T. Mikolov, et al., [Distributed Representations of Words and Phrases and Their Compositionality +](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). NIPS, 2013. +- T. Mikolov, et al., [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013. + +Hivemall provides two type algorithms: Skip-gram and CBoW with negative sampling. +Hivemall enables you to train your sequence data such as, +but not limited to, documents based on word2vec. +This article gives usage instructions of the feature. + +<!-- toc --> + +> #### Note +> This feature is supported from Hivemall v0.5-rc.? or later. + +# Prepare document data + +Assume that you already have `docs` table which contains many documents as string format with unique index: + +```sql +select * FROM docs; +``` + +| docId | doc | +|:----: |:----| +| 0 | "Alice was beginning to get very tired of sitting by her sister on the bank ..." | +| ... | ... | + +First, each document is split into words by tokenize function like a [`tokenize`](../misc/tokenizer.html). + +```sql +drop table docs_words; +create table docs_words as + select + docid, + tokenize(doc, true) as words + FROM + docs +; +``` + +This table shows tokenized document. + +| docId | doc | +|:----: |:----| +| 0 | ["alice", "was", "beginning", "to", "get", "very", "tired", "of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] | +| ... | ... | + +Then, you count frequency up per word and remove low frequency words from the vocabulary. +To remove low frequency words is optional preprocessing, but this process is effective to train word vector fastly. + +```sql +set hivevar:mincount=5; + +drop table freq; +create table freq as +select + row_number() over () - 1 as wordid, + word, + freq +from ( + select + word, + COUNT(*) as freq + from + docs_words + LATERAL VIEW explode(words) lTable as word + group by + word +) t +where freq >= ${mincount} +; +``` + +Hivemall's word2vec supports two type words; string and int. +String type tends to use huge memory during training. +On the other hand, int type tends to use less memory. +If you train on small dataset, we recommend using string type, +because memory usage can be ignored and HiveQL is more simple. +If you train on large dataset, we recommend using int type, +because it saves memory during training. + +# Create sub-sampling table + +Sub-sampling table is stored a sub-sampling probability per word. + +The sub-sampling probability of word $$w_i$$ is computed by the following equation: + +$$ +\begin{aligned} +f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + \frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)} +\end{aligned} +$$ + +During word2vec training, --- End diff -- remove line break after `,`.
---