[GitHub] szha closed pull request #9142: add lstm sentimental analysis tutorial

GitBox Wed, 02 May 2018 21:16:37 -0700

szha closed pull request #9142: add lstm sentimental analysis tutorial
URL: https://github.com/apache/incubator-mxnet/pull/9142


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/example/lstm-sentimal_analysis/lstm_cell.png 
b/example/lstm-sentimal_analysis/lstm_cell.png
new file mode 100644
index 00000000000..442b50b6f0f
Binary files /dev/null and b/example/lstm-sentimal_analysis/lstm_cell.png differ
diff --git a/example/lstm-sentimal_analysis/lstm_diagram.png 
b/example/lstm-sentimal_analysis/lstm_diagram.png
new file mode 100644
index 00000000000..b2cebc14e0b
Binary files /dev/null and b/example/lstm-sentimal_analysis/lstm_diagram.png 
differ
diff --git a/example/lstm-sentimal_analysis/tutorial_lstm.ipynb 
b/example/lstm-sentimal_analysis/tutorial_lstm.ipynb
new file mode 100644
index 00000000000..571288096b9
--- /dev/null
+++ b/example/lstm-sentimal_analysis/tutorial_lstm.ipynb
@@ -0,0 +1,491 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LSTM: Sentimental Analysis using IMDB movie review data set\n",
+    "\n",
+    "- In this tutorial, we will create a model that determines whether movie 
reviews are positive or negative.\n",
+    "- This tutorial is intended for those who have a good understanding of 
python and a brief understanding of the LSTM model.\n",
+    "\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "You need to install \n",
+    "    - Python 3.6. https://www.python.org/downloads/\n";,
+    "    - mxnet(gpu ver.) 
https://mxnet.incubator.apache.org/get_started/install.html\n";,
+    "\n",
+    "You also need to understand the LSTM model.  \n",
+    "In this tutorial, a simple knowledge already presented on several 
websites is sufficient. However, if you want a deep understanding, please refer 
to the following article.\n",
+    "    - https://arxiv.org/pdf/1503.00185v5.pdf\n";,
+    "    \n",
+    "Finally, you need to know about python, mxnet, gloun package, and so 
on.\n",
+    "\n",
+    "\n",
+    "## The Data\n",
+    "\n",
+    "We will use IMDB. You can download it from the following link.  \n",
+    "It is divided into 25000 train data and test data each labeled.  \n",
+    "See the data set's README for more information.\n",
+    "    - http://ai.stanford.edu/~amaas/data/sentiment/\n";,
+    "    \n",
+    "We will use Stanford's Global Vector for Word Representation (GloVe) 
embedding for word embedding.  \n",
+    "Download glove.42B.300d.zip from the following link:\n",
+    "    - https://nlp.stanford.edu/projects/glove/\n";,
+    "    \n",
+    "    \n",
+    "## Prepare the Data\n",
+    "\n",
+    "##### 1. Read the data and label it. (1 means positive and 0 means 
negative.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# helper function to read files\n",
+    "def read_files(folder_name):\n",
+    "    sentiments =[]\n",
+    "    filenames = os.listdir(os.curdir+\"/\"+folder_name)\n",
+    "\n",
+    "    for file in filenames:\n",
+    "        with open(folder_name+\"/\"+file, \"r\", encoding=\"utf8\") as 
f:\n",
+    "            data = f.read().replace(\"\\n\", \"\")\n",
+    "            sentiments.append(data)\n",
+    "\n",
+    "    return sentiments\n",
+    "\n",
+    "data_path = '../data/aclImdb'\n",
+    "\n",
+    "# prepare train data\n",
+    "train_pos_foldername = data_path + \"/train/pos\"\n",
+    "train_pos_sentiments = read_files(train_pos_foldername)\n",
+    "\n",
+    "train_neg_foldername = data_path + \"/train/neg\"\n",
+    "train_neg_sentiments = read_files(train_neg_foldername)\n",
+    "\n",
+    "train_pos_labels = [1 for _ in train_pos_sentiments]\n",
+    "train_neg_labels = [0 for _ in train_neg_sentiments]\n",
+    "train_all_labels = train_pos_labels + train_neg_labels\n",
+    "\n",
+    "train_all_sentiments = train_pos_sentiments + train_neg_sentiments\n",
+    "\n",
+    "# prepare test data\n",
+    "test_pos_foldername = data_path + \"/test/pos\"\n",
+    "test_pos_sentiments = read_files(test_pos_foldername)\n",
+    "\n",
+    "test_neg_foldername = data_path + \"/test/neg\"\n",
+    "test_neg_sentiments = read_files(test_neg_foldername)\n",
+    "\n",
+    "test_pos_labels = [1 for _ in test_pos_sentiments]\n",
+    "test_neg_labels = [0 for _ in test_neg_sentiments]\n",
+    "test_all_labels = test_pos_labels + test_neg_labels\n",
+    "\n",
+    "test_all_sentiments = train_pos_sentiments + train_neg_sentiments"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 2. We will make a dictionary from the words in the train data. 
Before that, let's define some helper functions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Delete various special characters from the review sentence.\n",
+    "def clear_str(sentiment):\n",
+    "    import re\n",
+    "    string = sentiment.lower().replace(\"<br />\", \" \")\n",
+    "    remove_spe_chars = re.compile(\"[^A-Za-z0-9 ]+\")\n",
+    "\n",
+    "    return re.sub(remove_spe_chars, \"\", string.lower())\n",
+    "\n",
+    "# Count the frequency of occurrence of a word.\n",
+    "def count_word(sentiments):\n",
+    "    from collections import Counter\n",
+    "    \n",
+    "    word_counter = Counter()\n",
+    "    for sentiment in sentiments:\n",
+    "        for word in (clear_str(sentiment)).split():\n",
+    "            if word not in word_counter.keys():\n",
+    "                word_counter[word] = 1\n",
+    "            else:\n",
+    "                word_counter[word] += 1\n",
+    "\n",
+    "    return word_counter\n",
+    "\n",
+    "# make dictionary\n",
+    "def create_word_index(word_counter):\n",
+    "    idx = 1\n",
+    "    word_dict = {}\n",
+    "\n",
+    "    for word in word_counter.most_common():\n",
+    "        word_dict[word[0]] = idx\n",
+    "        idx += 1\n",
+    "\n",
+    "    return word_dict \n",
+    "\n",
+    "# save file. \n",
+    "# The dictionary is very large. \n",
+    "# Therefore, it is better to do this only the first time you create it, 
and then save and load it as a file.\n",
+    "import _pickle as pkl\n",
+    "def save_file(file_obj, file_name):\n",
+    "    f = open(file_name, 'wb')\n",
+    "    pkl.dump(file_obj, f, -1)\n",
+    "    f.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 3. Now let's make a dictionary!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def make_dictionary(dictionary_file_name, sentiments):\n",
+    "    # if file exist, return file\n",
+    "    if os.path.exists(dictionary_file_name):\n",
+    "        f = open(dictionary_file_name, 'rb')\n",
+    "        word_dict = pkl.load(f)\n",
+    "        f.close()\n",
+    "        return word_dict\n",
+    "\n",
+    "    else:\n",
+    "        # count word\n",
+    "        word_counter = count_word(sentiments)\n",
+    "        word_dict = create_word_index(word_counter)\n",
+    "        save_file(word_dict, dictionary_file_name)\n",
+    "        return word_dict\n",
+    "    \n",
+    "\n",
+    "dictionary_file_name = 'imdb.dict.pkl'\n",
+    "word_dict = make_dictionary( dictionary_file_name, train_all_sentiments )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 4. Encode the sentence using the dictionary created and fit it into 
numpy format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode_sentences(input_file, word_dict):\n",
+    "    output_string = []\n",
+    "    for line in input_file:\n",
+    "        output_line = []\n",
+    "        for word in clear_str(line).split():\n",
+    "            if word in word_dict:\n",
+    "                output_line.append(word_dict[word])\n",
+    "        output_string.append(output_line)\n",
+    "\n",
+    "    return output_string\n",
+    "\n",
+    "train_pos_encoded = encode_sentences(train_pos_sentiments, word_dict)\n",
+    "train_neg_encoded = encode_sentences(train_neg_sentiments, word_dict)\n",
+    "train_all_encoded = train_pos_encoded + train_neg_encoded\n",
+    "\n",
+    "test_pos_encoded = encode_sentences(test_pos_sentiments, word_dict)\n",
+    "test_neg_encoded = encode_sentences(test_neg_sentiments, word_dict)\n",
+    "test_all_encoded = test_pos_encoded + test_neg_encoded\n",
+    "\n",
+    "\n",
+    "import numpy as np\n",
+    "voca_size = 10000 # total number of voca we will track\n",
+    "train_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i 
in s]) for s in train_all_encoded]\n",
+    "test_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i in 
s]) for s in test_all_encoded]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 5. We now want to show what each word means through word embedding, 
which is another big task. In this case, we would like to use the already 
created model, Stanford's Global Vector for Word Representation (Glove) 
embedding. You have already downloaded it from The Data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      
"/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46:
 DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom 
instead\n",
+      "  import OpenSSL.SSL\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mxnet import nd\n",
+    "def load_glove_index(path):\n",
+    "    import io\n",
+    "    f = io.open(path, encoding=\"utf8\")\n",
+    "    embeddings_index = {}\n",
+    "\n",
+    "    for line in f:\n",
+    "        values = line.split()\n",
+    "        word = values[0]\n",
+    "        coefs = np.asarray(values[1:], dtype = 'float32')\n",
+    "        embeddings_index[word] = coefs\n",
+    "    f.close()\n",
+    "    return embeddings_index\n",
+    "\n",
+    "def create_embed(filename, glove_path, num_embed):\n",
+    "    if os.path.exists(filename):\n",
+    "        f = open(filename, 'rb')\n",
+    "        embedding_matrix = pkl.load(f)\n",
+    "        f.close()\n",
+    "        return embedding_matrix\n",
+    "\n",
+    "    embedding_index = load_glove_index(glove_path)\n",
+    "    \n",
+    "    embedding_matrix = np.zeros((voca_size, num_embed))\n",
+    "    for word, i in word_dict.items():\n",
+    "        if i >= voca_size:\n",
+    "            continue\n",
+    "        embedding_vector = embedding_index.get(word)\n",
+    "        if embedding_vector is not None:\n",
+    "            embedding_matrix[i] = embedding_vector\n",
+    "    embedding_matrix = nd.array(embedding_matrix)\n",
+    "    save_file(embedding_matrix, filename)\n",
+    "    return embedding_matrix\n",
+    "\n",
+    "num_embed = 300\n",
+    "embed_matrix_filename = 'embed.metrix.pkl'\n",
+    "glove_path = '../data/glove.42B.300d.txt'\n",
+    "embedding_matrix = create_embed(embed_matrix_filename, glove_path, 
num_embed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 6. Shuffle the train data. As you already know, train data is in 
positive-negative order. However, if learning proceeds in this order, positive 
will be learned first, and correct learning will not be achieved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "x_train, x_test, y_train, y_test = train_test_split(train_data, 
train_all_labels, test_size=0, random_state=42)\n",
+    "\n",
+    "x_test = test_data\n",
+    "y_test = test_all_labels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 7. Match the length of the sentence and change it to nd array form 
of mxnet.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seq_len = 250 # set the max word length of each movie review\n",
+    "\n",
+    "# if sentence is greater than max_len, truncates\n",
+    "# if less, pad with value\n",
+    "def pad_sequences(sentences, max_len=500, value = 0):\n",
+    "    padded_sentences = []\n",
+    "    for sentence in sentences:\n",
+    "        new_sentence = []\n",
+    "        if (len(sentence) > max_len):\n",
+    "            new_sentence = sentence[:max_len]\n",
+    "            padded_sentences.append(new_sentence)\n",
+    "        else:\n",
+    "            new_sentence = np.append(sentence, 
[value]*(max_len-len(sentence)))\n",
+    "            padded_sentences.append(new_sentence)\n",
+    "\n",
+    "    return padded_sentences\n",
+    "\n",
+    "import mxnet as mx\n",
+    "context = mx.gpu()\n",
+    "X_train = nd.array(pad_sequences(x_train, max_len=seq_len, value=0), 
context)\n",
+    "X_test = nd.array(pad_sequences(x_test, max_len=seq_len, value=0), 
context)\n",
+    "Y_train = nd.array(y_train, context)\n",
+    "Y_test = nd.array(y_test, context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Model\n",
+    "\n",
+    "As you already know, lstm unit has the following form, which saves the 
previous state:\n",
+    "\n",
+    "![Alt text](lstm_cell.png)\n",
+    "\n",
+    "We will build this lstm unit and build a model that determines the 
positive and negative by setting the sigmoid fucntion on top of it to obtain 
the output between 0 and 1.\n",
+    "\n",
+    "![Alt text](lstm_diagram.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_classes = 1\n",
+    "num_hidden = 16\n",
+    "learning_rate = .01\n",
+    "epochs = 10\n",
+    "batch_size = 100\n",
+    "\n",
+    "from mxnet.gluon import nn, rnn\n",
+    "\n",
+    "model = nn.Sequential()\n",
+    "with model.name_scope():\n",
+    "    model.embed = nn.Embedding(voca_size, num_embed)\n",
+    "    model.add(rnn.LSTM(num_hidden, layout = 'NTC', dropout=0.5, 
bidirectional=False))\n",
+    "    model.add(nn.Dense(num_classes))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train & Evaluate the Model\n",
+    "\n",
+    "1. We will proceed initialization with the xavior initializer. It is 
known to have the best performance among existing initializers.\n",
+    "2. We will use adadelta as the optimizer. This is also known to be better 
than classic sgd."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 0. Train_acc 0.83348, Test_acc 0.82272\n",
+      "Epoch 1. Train_acc 0.87936, Test_acc 0.85492\n",
+      "Epoch 2. Train_acc 0.90864, Test_acc 0.86396\n",
+      "Epoch 3. Train_acc 0.93144, Test_acc 0.8658\n",
+      "Epoch 4. Train_acc 0.94948, Test_acc 0.86368\n",
+      "Epoch 5. Train_acc 0.95868, Test_acc 0.85964\n",
+      "Epoch 6. Train_acc 0.9522, Test_acc 0.84844\n",
+      "Epoch 7. Train_acc 0.96992, Test_acc 0.85184\n",
+      "Epoch 8. Train_acc 0.9814, Test_acc 0.8524\n",
+      "Epoch 9. Train_acc 0.98608, Test_acc 0.84772\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mxnet import gluon, autograd\n",
+    "\n",
+    "model.collect_params().initialize(mx.init.Xavier(), ctx=context)\n",
+    "\n",
+    "model.embed.weight.set_data(embedding_matrix.as_in_context(context))\n",
+    "\n",
+    "trainer = gluon.Trainer(model.collect_params(), 'adadelta')\n",
+    "\n",
+    "sigmoid_cross_entropy = gluon.loss.SigmoidBCELoss()\n",
+    "\n",
+    "# helper function: evaluate accuracy\n",
+    "def eval_accuracy(x, y, batch_size):\n",
+    "    accuracy = mx.metric.Accuracy()\n",
+    "\n",
+    "    for i in range(x.shape[0] // batch_size):\n",
+    "        data = x[i*batch_size:(i*batch_size + batch_size), ]\n",
+    "        target = y[i*batch_size:(i*batch_size + batch_size), ]\n",
+    "\n",
+    "        output = model(data)\n",
+    "        predictions = nd.array([( 1 if out >= 0.5 else 0 ) for out in 
output ] , context)\n",
+    "        accuracy.update(preds=predictions, labels=target)\n",
+    "\n",
+    "    return accuracy.get()[1]\n",
+    "\n",
+    "# train\n",
+    "for epoch in range(epochs):\n",
+    "\n",
+    "    for b in range(X_train.shape[0] // batch_size):\n",
+    "        data = X_train[b*batch_size:(b*batch_size + batch_size),]\n",
+    "        target = Y_train[b*batch_size:(b*batch_size + batch_size),]\n",
+    "\n",
+    "        data = data.as_in_context(context)\n",
+    "        target = target.as_in_context(context)\n",
+    "        \n",
+    "        with autograd.record():\n",
+    "            output = model(data)\n",
+    "            L = sigmoid_cross_entropy(output, target)\n",
+    "        L.backward()\n",
+    "        trainer.step(data.shape[0])\n",
+    "\n",
+    "    # filename = \"lstm_net.params_epoch\" + str(epoch)\n",
+    "    # model.save_params(filename)\n",
+    "\n",
+    "    test_accuracy = eval_accuracy(X_test, Y_test, batch_size)\n",
+    "    train_accuracy = eval_accuracy(X_train, Y_train, batch_size)\n",
+    "    print(\"Epoch %s. Train_acc %s, Test_acc %s\" %\n",
+    "          (epoch, train_accuracy, test_accuracy))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "So far, sentimental analysis has been done through the model using lstm.  
\n",
+    "As you can see from the code above, the accuracy is about 0.86, but it is 
very difficult to raise it. You can get better accuracy by adjusting the batch 
size, optimizer, and number of hidden layers, but it will not change much.   
\n",
+    "This is because of the limitations of the LSTM model itself, there are 
other models with better accuracy of sentimental analysis."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:mxnet_p36]",
+   "language": "python",
+   "name": "conda-env-mxnet_p36-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] szha closed pull request #9142: add lstm sentimental analysis tutorial

Reply via email to