Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/116#discussion_r141544983 --- Diff: docs/gitbook/embedding/word2vec.md --- @@ -0,0 +1,399 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +Word Embedding is a powerful tool for many tasks, +e.g. finding similar words, +feature vectors for supervised machine learning task and word analogy, +such as `king - man + woman =~ queen`. +In word embedding, +each word represents a low dimension and dense vector. +**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular algorithms to obtain good word embeddings (a.k.a word2vec). + +The papers introduce the method are as follows: + +- T. Mikolov, et al., [Distributed Representations of Words and Phrases and Their Compositionality +](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). NIPS, 2013. +- T. Mikolov, et al., [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013. + +Hivemall provides two type algorithms: Skip-gram and CBoW with negative sampling. +Hivemall enables you to train your sequence data such as, +but not limited to, documents based on word2vec. +This article gives usage instructions of the feature. + +<!-- toc --> + +> #### Note +> This feature is supported from Hivemall v0.5-rc.? or later. + +# Prepare document data + +Assume that you already have `docs` table which contains many documents as string format with unique index: + +```sql +select * FROM docs; +``` + +| docId | doc | +|:----: |:----| +| 0 | "Alice was beginning to get very tired of sitting by her sister on the bank ..." | +| ... | ... | + +First, each document is split into words by tokenize function like a [`tokenize`](../misc/tokenizer.html). + +```sql +drop table docs_words; +create table docs_words as + select + docid, + tokenize(doc, true) as words + FROM + docs +; +``` + +This table shows tokenized document. + +| docId | doc | +|:----: |:----| +| 0 | ["alice", "was", "beginning", "to", "get", "very", "tired", "of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] | +| ... | ... | + +Then, you count frequency up per word and remove low frequency words from the vocabulary. +To remove low frequency words is optional preprocessing, but this process is effective to train word vector fastly. + +```sql +set hivevar:mincount=5; + +drop table freq; +create table freq as +select + row_number() over () - 1 as wordid, + word, + freq +from ( + select + word, + COUNT(*) as freq + from + docs_words + LATERAL VIEW explode(words) lTable as word + group by + word +) t +where freq >= ${mincount} +; +``` + +Hivemall's word2vec supports two type words; string and int. +String type tends to use huge memory during training. +On the other hand, int type tends to use less memory. +If you train on small dataset, we recommend using string type, +because memory usage can be ignored and HiveQL is more simple. +If you train on large dataset, we recommend using int type, +because it saves memory during training. + +# Create sub-sampling table + +Sub-sampling table is stored a sub-sampling probability per word. + +The sub-sampling probability of word $$w_i$$ is computed by the following equation: + +$$ +\begin{aligned} +f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + \frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)} +\end{aligned} +$$ + +During word2vec training, +not sub-sampled words are ignored. +It works to train fastly and to consider the imbalance the rare words and frequent words by reducing frequent words. +The smaller `sample` value set, +the fewer words are used during training. + +```sql +set hivevar:sample=1e-4; + +drop table subsampling_table; +create table subsampling_table as +with stats as ( + select + sum(freq) as numTrainWords + FROM + freq +) +select + l.wordid, + l.word, + sqrt(${sample}/(l.freq/r.numTrainWords)) + ${sample}/(l.freq/r.numTrainWords) as p +from + freq l +cross join + stats r +; +``` + +```sql +select * FROM subsampling_table order by p; +``` + +| wordid | word | p | +|:----: | :----: |:----:| +| 48645 | the | 0.04013665| +| 11245 | of | 0.052463654| +| 16368 | and | 0.06555538| +| 61938 | 00 | 0.068162076| +| 19977 | in | 0.071441144| +| 83599 | 0 | 0.07528994| +| 95017 | a | 0.07559573| +| 1225 | to | 0.07953133| +| 37062 | 0000 | 0.08779001| +| 58246 | is | 0.09049763| +| ... | ... |... | + +The first row shows that 4% of `the` are used in the documents during training. + +# Delete low frequency words and high frequency words from `docs_words` + +To reduce useless words from corpus, +low frequency words and high frequency words are deleted. +And, to avoid loading long document on memory, a document is split into some sub-documents. + +```sql +set hivevar:maxlength=1500; +SET hivevar:seed=31; + +drop table train_docs; +create table train_docs as + with docs_exploded as ( + select + docid, + word, + pos % ${maxlength} as pos, + pos div ${maxlength} as splitid, + rand(${seed}) as rnd + from + docs_words LATERAL VIEW posexplode(words) t as pos, word + ) +select + l.docid, + -- to_ordered_list(l.word, l.pos) as words + to_ordered_list(r2.wordid, l.pos) as words, +from + docs_exploded l + LEFT SEMI join freq r on (l.word = r.word) + join subsampling_table r2 on (l.word = r2.word) +where + r2.p > l.rnd +group by + l.docid, l.splitid +; +``` + +If you store string word in `train_docs` table, +please replace `to_ordered_list(r2.wordid, l.pos) as words` with `to_ordered_list(l.word, l.pos) as words`. + +# Create negative sampling table + +Negative sampling is an approximate function of [softmax function](https://en.wikipedia.org/wiki/Softmax_function). +Here, `negative_table` is used to store word sampling probability for negative sampling. +`z` is a hyperparameter of noise distribution for negative sampling. +During word2vec training, --- End diff -- Line break is not needed. Line break after `,` is unreasonable (elsewhere as well).
---