[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

myui Thu, 28 Sep 2017 00:54:53 -0700

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/116#discussion_r141545337
  
    --- Diff: docs/gitbook/embedding/word2vec.md ---
    @@ -0,0 +1,399 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +Word Embedding is a powerful tool for many tasks,
    +e.g. finding similar words,
    +feature vectors for supervised machine learning task and word analogy,
    +such as `king - man + woman =~ queen`.
    +In word embedding,
    +each word represents a low dimension and dense vector.
    +**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular 
algorithms to obtain good word embeddings (a.k.a word2vec).
    +
    +The papers introduce the method are as follows:
    +
    +- T. Mikolov, et al., [Distributed Representations of Words and Phrases 
and Their Compositionality
    
+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 NIPS, 2013.
    +- T. Mikolov, et al., [Efficient Estimation of Word Representations in 
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
    +
    +Hivemall provides two type algorithms: Skip-gram and CBoW with negative 
sampling.
    +Hivemall enables you to train your sequence data such as,
    +but not limited to, documents based on word2vec.
    +This article gives usage instructions of the feature.
    +
    +<!-- toc -->
    +
    +> #### Note
    +> This feature is supported from Hivemall v0.5-rc.? or later.
    +
    +# Prepare document data
    +
    +Assume that you already have `docs` table which contains many documents as 
string format with unique index:
    +
    +```sql
    +select * FROM docs;
    +```
    +
    +| docId | doc |
    +|:----: |:----|
    +|   0   | "Alice was beginning to get very tired of sitting by her sister 
on the bank ..." |
    +|  ...  | ... |
    +
    +First, each document is split into words by tokenize function like a 
[`tokenize`](../misc/tokenizer.html).
    +
    +```sql
    +drop table docs_words;
    +create table docs_words as
    +  select
    +    docid,
    +    tokenize(doc, true) as words
    +  FROM
    +    docs
    +;
    +```
    +
    +This table shows tokenized document.
    +
    +| docId | doc |
    +|:----: |:----|
    +|   0   | ["alice", "was", "beginning", "to", "get", "very", "tired", 
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
    +|  ...  | ... |
    +
    +Then, you count frequency up per word and remove low frequency words from 
the vocabulary.
    +To remove low frequency words is optional preprocessing, but this process 
is effective to train word vector fastly.
    +
    +```sql
    +set hivevar:mincount=5;
    +
    +drop table freq;
    +create table freq as
    +select
    +  row_number() over () - 1 as wordid,
    +  word,
    +  freq
    +from (
    +  select
    +    word,
    +    COUNT(*) as freq
    +  from
    +    docs_words
    +  LATERAL VIEW explode(words) lTable as word
    +  group by
    +    word
    +) t
    +where freq >= ${mincount}
    +;
    +```
    +
    +Hivemall's word2vec supports two type words; string and int.
    +String type tends to use huge memory during training.
    +On the other hand, int type tends to use less memory.
    +If you train on small dataset, we recommend using string type,
    +because memory usage can be ignored and HiveQL is more simple.
    +If you train on large dataset, we recommend using int type,
    +because it saves memory during training.
    +
    +# Create sub-sampling table
    +
    +Sub-sampling table is stored a sub-sampling probability per word.
    +
    +The sub-sampling probability of word $$w_i$$ is computed by the following 
equation:
    +
    +$$
    +\begin{aligned}
    +f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} + 
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
    +\end{aligned}
    +$$
    +
    +During word2vec training,
    --- End diff --
    
    remove line break after `,`.

---

[GitHub] incubator-hivemall pull request #116: [WIP][HIVEMALL-118] word2vec

Reply via email to