Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545135 --- Diff: docs/gitbook/embedding/word2vec.md --- @@ -0,0 +1,399 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +Word Embedding is a powerful tool for many tasks, +e.g. finding similar words, +feature vectors for supervised machine learning task and word analogy, +such as `king - man + woman =~ queen`. +In word embedding, +each word represents a low dimension and dense vector. +**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular algorithms to obtain good word embeddings (a.k.a word2vec). + +The papers introduce the method are as follows: + +- T. Mikolov, et al., [Distributed Representations of Words and Phrases and Their Compositionality +](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). NIPS, 2013. +- T. Mikolov, et al., [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013. + +Hivemall provides two type algorithms: Skip-gram and CBoW with negative sampling. +Hivemall enables you to train your sequence data such as, +but not limited to, documents based on word2vec. +This article gives usage instructions of the feature. + +<!-- toc --> + +> #### Note +> This feature is supported from Hivemall v0.5-rc.? or later. + +# Prepare document data + +Assume that you already have `docs` table which contains many documents as string format with unique index: + +```sql +select * FROM docs; +``` + +| docId | doc | +|:----: |:----| +| 0 | "Alice was beginning to get very tired of sitting by her sister on the bank ..." | +| ... | ... | + +First, each document is split into words by tokenize function like a [`tokenize`](../misc/tokenizer.html). + +```sql +drop table docs_words; +create table docs_words as + select + docid, + tokenize(doc, true) as words + FROM + docs +; +``` + +This table shows tokenized document. + +| docId | doc | +|:----: |:----| +| 0 | ["alice", "was", "beginning", "to", "get", "very", "tired", "of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] | +| ... | ... | + +Then, you count frequency up per word and remove low frequency words from the vocabulary. +To remove low frequency words is optional preprocessing, but this process is effective to train word vector fastly. + +```sql +set hivevar:mincount=5; + +drop table freq; +create table freq as +select + row_number() over () - 1 as wordid, + word, + freq +from ( + select + word, + COUNT(*) as freq + from + docs_words + LATERAL VIEW explode(words) lTable as word + group by + word +) t +where freq >= ${mincount} +; +``` + +Hivemall's word2vec supports two type words; string and int. +String type tends to use huge memory during training. +On the other hand, int type tends to use less memory. +If you train on small dataset, we recommend using string type, +because memory usage can be ignored and HiveQL is more simple. +If you train on large dataset, we recommend using int type, +because it saves memory during training. + +# Create sub-sampling table + +Sub-sampling table is stored a sub-sampling probability per word. + +The sub-sampling probability of word $$w_i$$ is computed by the following equation: + +$$ +\begin{aligned} +f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + \frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)} +\end{aligned} +$$ + +During word2vec training, +not sub-sampled words are ignored. +It works to train fastly and to consider the imbalance the rare words and frequent words by reducing frequent words. +The smaller `sample` value set, +the fewer words are used during training. + +```sql +set hivevar:sample=1e-4; + +drop table subsampling_table; +create table subsampling_table as +with stats as ( + select + sum(freq) as numTrainWords + FROM + freq +) +select + l.wordid, + l.word, + sqrt(${sample}/(l.freq/r.numTrainWords)) + ${sample}/(l.freq/r.numTrainWords) as p +from + freq l +cross join + stats r +; +``` + +```sql +select * FROM subsampling_table order by p; +``` + +| wordid | word | p | +|:----: | :----: |:----:| +| 48645 | the | 0.04013665| +| 11245 | of | 0.052463654| +| 16368 | and | 0.06555538| +| 61938 | 00 | 0.068162076| +| 19977 | in | 0.071441144| +| 83599 | 0 | 0.07528994| +| 95017 | a | 0.07559573| +| 1225 | to | 0.07953133| +| 37062 | 0000 | 0.08779001| +| 58246 | is | 0.09049763| +| ... | ... |... | + +The first row shows that 4% of `the` are used in the documents during training. + +# Delete low frequency words and high frequency words from `docs_words` + +To reduce useless words from corpus, +low frequency words and high frequency words are deleted. +And, to avoid loading long document on memory, a document is split into some sub-documents. + +```sql +set hivevar:maxlength=1500; +SET hivevar:seed=31; + +drop table train_docs; +create table train_docs as + with docs_exploded as ( + select + docid, + word, + pos % ${maxlength} as pos, + pos div ${maxlength} as splitid, + rand(${seed}) as rnd + from + docs_words LATERAL VIEW posexplode(words) t as pos, word + ) +select + l.docid, + -- to_ordered_list(l.word, l.pos) as words + to_ordered_list(r2.wordid, l.pos) as words, +from + docs_exploded l + LEFT SEMI join freq r on (l.word = r.word) + join subsampling_table r2 on (l.word = r2.word) +where + r2.p > l.rnd +group by + l.docid, l.splitid +; +``` + +If you store string word in `train_docs` table, +please replace `to_ordered_list(r2.wordid, l.pos) as words` with `to_ordered_list(l.word, l.pos) as words`. + +# Create negative sampling table + +Negative sampling is an approximate function of [softmax function](https://en.wikipedia.org/wiki/Softmax_function). +Here, `negative_table` is used to store word sampling probability for negative sampling. +`z` is a hyperparameter of noise distribution for negative sampling. +During word2vec training, +words sampled from this distribution are used for negative examples. +Noise distribution is the unigram distribution raised to the 3/4rd power. + +$$ +\begin{aligned} +p(w_i) = \frac{freq(w_i)^{\mathrm{z}}}{\sum freq(w)^{\mathrm{z}}} +\end{aligned} +$$ + +To avoid using huge memory space for negative sampling like original implementation and remain to sample fastly from this distribution, +Hivemall uses [Alias method](https://en.wikipedia.org/wiki/Alias_method). + +This method has proposed in papers below: + +- A. J. Walker, New Fast Method for Generating Discrete Random Numbers with Arbitrary Frequency Distributions, in Electronics Letters 10, no. 8, pp. 127-128, 1974. +- A. J. Walker, An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Transactions on Mathematical Software 3, no. 3, pp. 253-256, 1977. + +```sql +set hivevar:z=3/4; + +drop table negative_table; +create table negative_table as +select + collect_list(array(word, p, other)) as negative_table +from ( + select + alias_table(to_map(word, negative)) as (word, p, other) + from + ( + select + word, + -- wordid as word, + pow(freq, ${z}) as negative + from + freq + ) t +) t1 +; +``` + +`alias_table` function returns the records like following. + +| word | p | other | +|:----: | :----: |:----:| +| leopold | 0.6556492 | 0000 | +| slep | 0.09060383 | leopold | +| valentinian | 0.76077825 | belarusian | +| slew | 0.90569097 | colin | +| lucien | 0.86329675 | overland | +| equitable | 0.7270946 | farms | +| insurers | 0.2367955 | israel | +| lucier | 0.14855136 | supplements | +| lieve | 0.12075222 | separatist | +| skyhawks | 0.14079945 | steamed | +| ... | ... | ... | + +To sample negative word from this `negative_table`, + +1. Sample record int index `i` from $$[0 \ldots \mathrm{num\_alias\_table\_records}]$$. +2. Sample float value `r` from $$[0.0 \ldots 1.0]$$ . +3. If `r` < `p` of `i` th record, return `word` `i` th record, else return `other` of `i` th record. + +Here, to use it in training function of word2vec, +`alias_table`'s return records are stored into one list in the `negative_table`. + +# Train word2vec + +Hivemall provides `train_word2vec` function to train word vector by word2vec algorithms. +The default model is `"skipgram"`. + +> #### Note +> You must pass `n` argumet to the number of words in training documents: `select sum(size(words)) from train_docs;`. + +## Train Skip-Gram + +In skip-gram model, +word vectors are trained to predict the nearby words. +For example, given a sentence like a `"alice", "was", "beginning", "to"`, +`"was"` vector is learnt to predict `"alice"` ,`"beginning"` and `"to"`. + +```sql +select sum(size(words)) from train_docs; +set hivevar:n=418953; -- previous query return value + +drop table skipgram; +create table skipgram as +select + train_word2vec( + r.negative_table, + l.words, + "-n ${n} -win 5 -neg 15 -iter 5 -dim 100 -model skipgram" + ) +from + train_docs l + cross join negative_table r +; +``` + +When word is treated as int istead of string, +you may need to transform wordid of int to word of string by `join` statement. + +```sql +drop table skipgram; + +create table skipgram as +select + r.word, t.i, t.wi +from ( + select + train_word2vec( + r.negative_table, + l.wordsint, + "-n 418953 -win 5 -neg 15 -iter 5" --- End diff -- `-n 418953` should be `-n ${n}`
---