[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

myui Thu, 28 Sep 2017 00:54:52 -0700

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/116#discussion_r141544983
  
    --- Diff: docs/gitbook/embedding/word2vec.md ---
    @@ -0,0 +1,399 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +Word Embedding is a powerful tool for many tasks,
    +e.g. finding similar words,
    +feature vectors for supervised machine learning task and word analogy,
    +such as `king - man + woman =~ queen`.
    +In word embedding,
    +each word represents a low dimension and dense vector.
    +**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
    +
    +The papers introduce the method are as follows:
    +
    +- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality
    
+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
    +- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
    +
    +Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
    +Hivemall enables you to train your sequence data such as,
    +but not limited to, documents based on word2vec.
    +This article gives usage instructions of the feature.
    +
    +<!-- toc -->
    +
    +> #### Note
    +> This feature is supported from Hivemall v0.5-rc.? or later.
    +
    +# Prepare document data
    +
    +Assume that you already have `docs` table which contains many documents as 
string format with unique index:
    +
    +```sql
    +select * FROM docs;
    +```
    +
    +| docId | doc |
    +|:----: |:----|
    +|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
    +|  ...  | ... |
    +
    +First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
    +
    +```sql
    +drop table docs_words;
    +create table docs_words as
    +  select
    +    docid,
    +    tokenize(doc, true) as words
    +  FROM
    +    docs
    +;
    +```
    +
    +This table shows tokenized document.
    +
    +| docId | doc |
    +|:----: |:----|
    +|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
    +|  ...  | ... |
    +
    +Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
    +To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
    +
    +```sql
    +set hivevar:mincount=5;
    +
    +drop table freq;
    +create table freq as
    +select
    +  row_number() over () - 1 as wordid,
    +  word,
    +  freq
    +from (
    +  select
    +    word,
    +    COUNT(*) as freq
    +  from
    +    docs_words
    +  LATERAL VIEW explode(words) lTable as word
    +  group by
    +    word
    +) t
    +where freq >= ${mincount}
    +;
    +```
    +
    +Hivemall's word2vec supports two type words; string and int.
    +String type tends to use huge memory during training.
    +On the other hand, int type tends to use less memory.
    +If you train on small dataset, we recommend using string type,
    +because memory usage can be ignored and HiveQL is more simple.
    +If you train on large dataset, we recommend using int type,
    +because it saves memory during training.
    +
    +# Create sub-sampling table
    +
    +Sub-sampling table is stored a sub-sampling probability per word.
    +
    +The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
    +
    +$$
    +\begin{aligned}
    +f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
    +\end{aligned}
    +$$
    +
    +During word2vec training,
    +not sub-sampled words are ignored.
    +It works to train fastly and to consider the imbalance the rare words and 
frequent words by reducing frequent words.
    +The smaller `sample` value set,
    +the fewer words are used during training.
    +
    +```sql
    +set hivevar:sample=1e-4;
    +
    +drop table subsampling_table;
    +create table subsampling_table as
    +with stats as (
    +  select
    +    sum(freq) as numTrainWords
    +  FROM
    +    freq
    +)
    +select
    +  l.wordid,
    +  l.word,
    +  sqrt(${sample}/(l.freq/r.numTrainWords)) + 
${sample}/(l.freq/r.numTrainWords) as p
    +from
    +  freq l
    +cross join
    +  stats r
    +;
    +```
    +
    +```sql
    +select * FROM subsampling_table order by p;
    +```
    +
    +| wordid | word | p |
    +|:----: | :----: |:----:|
    +| 48645 | the  | 0.04013665|
    +| 11245 | of   | 0.052463654|
    +| 16368 | and  | 0.06555538|
    +| 61938 | 00   | 0.068162076|
    +| 19977 | in   | 0.071441144|
    +| 83599 | 0    | 0.07528994|
    +| 95017 | a    | 0.07559573|
    +| 1225  | to   | 0.07953133|
    +| 37062 | 0000 | 0.08779001|
    +| 58246 | is   | 0.09049763|
    +|  ...  | ...  |... |
    +
    +The first row shows that 4% of `the` are used in the documents during 
training.
    +
    +# Delete low frequency words and high frequency words from `docs_words`
    +
    +To reduce useless words from corpus,
    +low frequency words and high frequency words are deleted.
    +And, to avoid loading long document on memory, a  document is split into 
some sub-documents.
    +
    +```sql
    +set hivevar:maxlength=1500;
    +SET hivevar:seed=31;
    +
    +drop table train_docs;
    +create table train_docs as
    +  with docs_exploded as (
    +    select
    +      docid,
    +      word,
    +      pos % ${maxlength} as pos,
    +      pos div ${maxlength} as splitid,
    +      rand(${seed}) as rnd
    +    from
    +      docs_words LATERAL VIEW posexplode(words) t as pos, word
    +  )
    +select
    +  l.docid,
    +  -- to_ordered_list(l.word, l.pos) as words
    +  to_ordered_list(r2.wordid, l.pos) as words,
    +from
    +  docs_exploded l
    +  LEFT SEMI join freq r on (l.word = r.word)
    +  join subsampling_table r2 on (l.word = r2.word)
    +where
    +  r2.p > l.rnd
    +group by
    +  l.docid, l.splitid
    +;
    +```
    +
    +If you store string word in `train_docs` table,
    +please replace `to_ordered_list(r2.wordid, l.pos) as words` with  
`to_ordered_list(l.word, l.pos) as words`.
    +
    +# Create negative sampling table
    +
    +Negative sampling is an approximate function of [softmax 
function](https://en.wikipedia.org/wiki/Softmax_function).
    +Here, `negative_table` is used to store word sampling probability for 
negative sampling.
    +`z` is a hyperparameter of noise distribution for negative sampling.
    +During word2vec training,
    --- End diff --
    
    Line break is not needed. Line break after `,` is unreasonable (elsewhere 
as well).

---

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

Reply via email to