[GitHub] astonzhang commented on a change in pull request #8763: Add mxnet.text APIs

GitBox Mon, 01 Jan 2018 17:46:11 -0800

astonzhang commented on a change in pull request #8763: Add mxnet.text APIs
URL: https://github.com/apache/incubator-mxnet/pull/8763#discussion_r159168682


 ##########
 File path: python/mxnet/text/embedding.py
 ##########
 @@ -0,0 +1,722 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# coding: utf-8
+# pylint: disable=consider-iterating-dictionary
+
+"""Read text files and load embeddings."""
+from __future__ import absolute_import
+from __future__ import print_function
+
+from collections import Counter
+import io
+import logging
+import os
+import tarfile
+import warnings
+import zipfile
+
+from ..gluon.utils import check_sha1
+from ..gluon.utils import download
+from .. import ndarray as nd
+
+
+class TextIndexer(object):
+    """Indexing for text tokens.
+
+
+    Build indices for the unknown token, reserved tokens, and input counter
+    keys. Indexed tokens can be used by instances of
+    :func:`~mxnet.text.embeddings.TextEmbed`, such as instances of
+    :func:`~mxnet.text.glossary.Glossary`.
+
+
+    Parameters
+    ----------
+    counter : collections.Counter or None, default None
+        Counts text token frequencies in the text data. Its keys will be 
indexed
+        according to frequency thresholds such as `most_freq_count` and
+        `min_freq`.
+    most_freq_count : None or int, default None
+        The maximum possible number of the most frequent tokens in the keys of
+        `counter` that can be indexed. Note that this argument does not count
+        any token from `reserved_tokens`. If this argument is None or larger
+        than its largest possible value restricted by `counter` and
+        `reserved_tokens`, this argument becomes positive infinity.
+    min_freq : int, default 1
+        The minimum frequency required for a token in the keys of `counter` to
+        be indexed.
+    unknown_token : str, default '<unk>'
+        The string representation for any unknown token. In other words, any
+        unknown token will be indexed as the same string representation. This
+        string representation cannot be any token to be indexed from the keys 
of
+        `counter` or from `reserved_tokens`.
+    reserved_tokens : list of strs or None, default None
+        A list of reserved tokens that will always be indexed. It cannot 
contain
+        `unknown_token`, or duplicate reserved tokens.
+
+
+    Properties
+    ----------
+    token_to_idx : dict mapping str to int
+        A dict mapping each token to its index integer.
+    idx_to_token : list of strs
+        A list of indexed tokens where the list indices and the token indices
+        are aligned.
+    unknown_token : str
+        The string representation for any unknown token. In other words, any
+        unknown token will be indexed as the same string representation.
+    reserved_tokens : list of strs or None
+        A list of reserved tokens that will always be indexed.
+    unknown_idx : int
+        The index for `unknown_token`.
+    """
+    def __init__(self, counter=None, most_freq_count=None, min_freq=1,
+                 unknown_token='<unk>', reserved_tokens=None):
+        # Sanity checks.
+        assert min_freq > 0, '`min_freq` must be set to a positive value.'
+
+        if reserved_tokens is not None:
+            for reserved_token in reserved_tokens:
+                assert reserved_token != unknown_token, \
+                    '`reserved_token` cannot contain `unknown_token`.'
+            assert len(set(reserved_tokens)) == len(reserved_tokens), \
+                '`reserved_tokens` cannot contain duplicate reserved tokens.'
+
+        self._index_unknown_and_reserved_tokens(unknown_token, reserved_tokens)
+
+        if counter is not None:
+            self._index_counter_keys(counter, unknown_token, reserved_tokens,
+                                     most_freq_count, min_freq)
 
 Review comment:
   Besides, `unknown_token not in counter` is much slower than `unknow_token 
not in set(counter.keys())`. Thus, handling inside `_index_counter_keys` saves 
time and memory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] astonzhang commented on a change in pull request #8763: Add mxnet.text APIs

Reply via email to