commit python-annoy for openSUSE:Factory

root Wed, 06 Mar 2019 06:53:07 -0800

Hello community,

here is the log from the commit of package python-annoy for openSUSE:Factory 
checked in at 2019-03-06 15:52:22
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-annoy (Old)
 and      /work/SRC/openSUSE:Factory/.python-annoy.new.28833 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "python-annoy"

Wed Mar  6 15:52:22 2019 rev:3 rq:682131 version:1.15.1

Changes:
--------
--- /work/SRC/openSUSE:Factory/python-annoy/python-annoy.changes        
2018-07-13 10:21:01.694432265 +0200
+++ /work/SRC/openSUSE:Factory/.python-annoy.new.28833/python-annoy.changes     
2019-03-06 15:52:33.608421873 +0100
@@ -1,0 +2,11 @@
+Wed Mar  6 12:09:25 UTC 2019 - Tomáš Chvátal <tchva...@suse.com>
+
+- Update to 1.15.1:
+  * Various minor fixes
+  * Fixes to the Euclidean distance function (avoid catastrophic cancellation)
+  * Don't MAP_POPULATE by default
+  * dot products are now supported
+- Update patch reproducible.patch:
+  * expand to not screw with cflags either
+
+-------------------------------------------------------------------

Old:
----
  annoy-1.12.0.tar.gz

New:
----
  annoy-1.15.1.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-annoy.spec ++++++
--- /var/tmp/diff_new_pack.TGEvdt/_old  2019-03-06 15:52:36.092421374 +0100
+++ /var/tmp/diff_new_pack.TGEvdt/_new  2019-03-06 15:52:36.124421367 +0100
@@ -1,7 +1,7 @@
 #
 # spec file for package python-annoy
 #
-# Copyright (c) 2018 SUSE LINUX GmbH, Nuernberg, Germany.
+# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany.
 #
 # All modifications and additions to the file contributed by third parties
 # remain the property of their copyright owners, unless otherwise agreed
@@ -12,18 +12,18 @@
 # license that conforms to the Open Source Definition (Version 1.9)
 # published by the Open Source Initiative.
 
-# Please submit bugfixes or comments via http://bugs.opensuse.org/
+# Please submit bugfixes or comments via https://bugs.opensuse.org/
+#
 
 
 %{?!python_module:%define python_module() python-%{**} python3-%{**}}
-%bcond_with     test
 Name:           python-annoy
-Version:        1.12.0
+Version:        1.15.1
 Release:        0
-License:        Apache-2.0
 Summary:        Approximate Nearest Neighbors
-Url:            https://github.com/spotify/annoy
+License:        Apache-2.0
 Group:          Development/Languages/Python
+URL:            https://github.com/spotify/annoy
 Source:         
https://files.pythonhosted.org/packages/source/a/annoy/annoy-%{version}.tar.gz
 # PATCH-FIX-OPENSUSE boo#1100677
 Patch0:         reproducible.patch
@@ -33,7 +33,6 @@
 BuildRequires:  c++_compiler
 BuildRequires:  fdupes
 BuildRequires:  python-rpm-macros
-
 %python_subpackages
 
 %description
@@ -48,20 +47,14 @@
 %patch0 -p1
 
 %build
-export CFLAGS="%{optflags}"
+export CFLAGS="%{optflags} -fno-strict-aliasing"
 %python_build
 
 %install
 %python_install
 %python_expand %fdupes %{buildroot}%{$python_sitearch}
 
-%if %{with test}
-%check
-%python_exec setup.py test
-%endif
-
 %files %{python_files}
-%defattr(-,root,root,-)
 %doc README.rst
 %license LICENSE
 %{python_sitearch}/*

++++++ annoy-1.12.0.tar.gz -> annoy-1.15.1.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/PKG-INFO new/annoy-1.15.1/PKG-INFO
--- old/annoy-1.12.0/PKG-INFO   2018-05-07 05:10:10.000000000 +0200
+++ new/annoy-1.15.1/PKG-INFO   2019-02-22 17:11:43.000000000 +0100
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: annoy
-Version: 1.12.0
+Version: 1.15.1
 Summary: Approximate Nearest Neighbors in C++/Python optimized for memory 
usage and loading/saving to disk.
 Home-page: https://github.com/spotify/annoy
 Author: Erik Bernhardsson
@@ -39,7 +39,7 @@
         Install
         -------
         
-        To install, simply do ``sudo pip install annoy`` to pull down the 
latest version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
+        To install, simply do ``pip install --user annoy`` to pull down the 
latest version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
         
         For the C++ version, just clone the repo and ``#include "annoylib.h"``.
         
@@ -48,7 +48,7 @@
         
         There are some other libraries to do nearest neighbor search. Annoy is 
almost as fast as the fastest libraries, (see below), but there is actually 
another feature that really sets Annoy apart: it has the ability to **use 
static files as indexes**. In particular, this means you can **share index 
across processes**. Annoy also decouples creating indexes from loading them, so 
you can pass around indexes as files and map them into memory quickly. Another 
nice thing of Annoy is that it tries to minimize memory footprint so the 
indexes are quite small.
         
-        Why is this useful? If you want to find nearest neighbors and you have 
many CPU's, you only need the RAM to fit the index once. You can also pass 
around and distribute static files to use in production environment, in Hadoop 
jobs, etc. Any process will be able to load (mmap) the index into memory and 
will be able to do lookups immediately.
+        Why is this useful? If you want to find nearest neighbors and you have 
many CPU's, you only need to build the index once. You can also pass around and 
distribute static files to use in production environment, in Hadoop jobs, etc. 
Any process will be able to load (mmap) the index into memory and will be able 
to do lookups immediately.
         
         We use it at `Spotify <http://www.spotify.com/>`__ for music 
recommendations. After running matrix factorization algorithms, every user/item 
can be represented as a vector in f-dimensional space. This library helps us 
search for similar users/items. We have many millions of tracks in a 
high-dimensional space, so memory usage is a prime concern.
         
@@ -57,13 +57,14 @@
         Summary of features
         -------------------
         
-        * `Euclidean distance 
<https://en.wikipedia.org/wiki/Euclidean_distance>`__, `Manhattan distance 
<https://en.wikipedia.org/wiki/Taxicab_geometry>`__, `cosine distance 
<https://en.wikipedia.org/wiki/Cosine_similarity>`__, or `Hamming distance 
<https://en.wikipedia.org/wiki/Hamming_distance>`__
+        * `Euclidean distance 
<https://en.wikipedia.org/wiki/Euclidean_distance>`__, `Manhattan distance 
<https://en.wikipedia.org/wiki/Taxicab_geometry>`__, `cosine distance 
<https://en.wikipedia.org/wiki/Cosine_similarity>`__, `Hamming distance 
<https://en.wikipedia.org/wiki/Hamming_distance>`__, or `Dot (Inner) Product 
distance <https://en.wikipedia.org/wiki/Dot_product>`__
         * Cosine distance is equivalent to Euclidean distance of normalized 
vectors = sqrt(2-2*cos(u, v))
         * Works better if you don't have too many dimensions (like <100) but 
seems to perform surprisingly well even up to 1,000 dimensions
         * Small memory usage
         * Lets you share memory between multiple processes
         * Index creation is separate from lookup (in particular you can not 
add more items once the tree has been created)
-        * Native Python support, tested with 2.6, 2.7, 3.3, 3.4, 3.5
+        * Native Python support, tested with 2.7, 3.6, and 3.7.
+        * Build index on disk to enable indexing big datasets that won't fit 
into memory (contributed by `Rene Hollander 
<https://github.com/ReneHollander>`__)
         
         Python code example
         -------------------
@@ -93,17 +94,19 @@
         Full Python API
         ---------------
         
-        * ``AnnoyIndex(f, metric='angular')`` returns a new index that's 
read-write and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, or ``"hamming"``.
+        * ``AnnoyIndex(f, metric='angular')`` returns a new index that's 
read-write and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, ``"hamming"``, or ``"dot"``.
         * ``a.add_item(i, v)`` adds item ``i`` (any nonnegative integer) with 
vector ``v``. Note that it will allocate memory for ``max(i)+1`` items.
         * ``a.build(n_trees)`` builds a forest of ``n_trees`` trees. More 
trees gives higher precision when querying. After calling ``build``, no more 
items can be added.
-        * ``a.save(fn)`` saves the index to disk.
-        * ``a.load(fn)`` loads (mmaps) an index from disk.
+        * ``a.save(fn, prefault=False)`` saves the index to disk and loads it 
(see next function). After saving, no more items can be added.
+        * ``a.load(fn, prefault=False)`` loads (mmaps) an index from disk. If 
`prefault` is set to `True`, it will pre-read the entire file into memory 
(using mmap with `MAP_POPULATE`). Default is `False`.
         * ``a.unload()`` unloads.
         * ``a.get_nns_by_item(i, n, search_k=-1, include_distances=False)`` 
returns the ``n`` closest items. During the query it will inspect up to 
``search_k`` nodes which defaults to ``n_trees * n`` if not provided. 
``search_k`` gives you a run-time tradeoff between better accuracy and speed. 
If you set ``include_distances`` to ``True``, it will return a 2 element tuple 
with two lists in it: the second one containing all corresponding distances.
         * ``a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)`` 
same but query by vector ``v``.
         * ``a.get_item_vector(i)`` returns the vector for item ``i`` that was 
previously added.
         * ``a.get_distance(i, j)`` returns the distance between items ``i`` 
and ``j``. NOTE: this used to return the *squared* distance, but has been 
changed as of Aug 2016.
         * ``a.get_n_items()`` returns the number of items in the index.
+        * ``a.get_n_trees()`` returns the number of trees in the index.
+        * ``a.on_disk_build(fn)`` prepares annoy to build the index in the 
specified file instead of RAM (execute before adding items, no need to save 
after build)
         
         Notes:
         
@@ -116,12 +119,15 @@
         Tradeoffs
         ---------
         
-        There are just two parameters you can use to tune Annoy: the number of 
trees ``n_trees`` and the number of nodes to inspect during searching 
``search_k``.
+        There are just two main parameters needed to tune Annoy: the number of 
trees ``n_trees`` and the number of nodes to inspect during searching 
``search_k``.
         
         * ``n_trees`` is provided during build time and affects the build time 
and the index size. A larger value will give more accurate results, but larger 
indexes.
         * ``search_k`` is provided in runtime and affects the search 
performance. A larger value will give more accurate results, but will take 
longer time to return.
         
-        If ``search_k`` is not provided, it will default to ``n * n_trees`` 
where ``n`` is the number of approximate nearest neighbors. Otherwise, 
``search_k`` and ``n_trees`` are roughly independent, i.e. a the value of 
``n_trees`` will not affect search time if ``search_k`` is held constant and 
vice versa. Basically it's recommended to set ``n_trees`` as large as possible 
given the amount of memory you can afford, and it's recommended to set 
``search_k`` as large as possible given the time constraints you have for the 
queries.
+        If ``search_k`` is not provided, it will default to ``n * n_trees * 
D`` where ``n`` is the number of approximate nearest neighbors and ``D`` is a 
constant depending on the metric. Otherwise, ``search_k`` and ``n_trees`` are 
roughly independent, i.e. a the value of ``n_trees`` will not affect search 
time if ``search_k`` is held constant and vice versa. Basically it's 
recommended to set ``n_trees`` as large as possible given the amount of memory 
you can afford, and it's recommended to set ``search_k`` as large as possible 
given the time constraints you have for the queries.
+        
+        You can also accept slower search times in favour of reduced loading 
times, memory usage, and disk IO. On supported platforms the index is 
prefaulted during ``load`` and ``save``, causing the file to be pre-emptively 
read from disk into memory. If you set ``prefault`` to ``False``, pages of the 
mmapped index are instead read from disk and cached in memory on-demand, as 
necessary for a search to complete. This can significantly increase early 
search times but may be better suited for systems with low memory compared to 
index size, when few queries are executed against a loaded index, and/or when 
large areas of the index are unlikely to be relevant to search queries.
+        
         
         How does it work
         ----------------
@@ -132,6 +138,10 @@
         
         Hamming distance (contributed by `Martin Aumüller 
<https://github.com/maumueller>`__) packs the data into 64-bit integers under 
the hood and uses built-in bit count primitives so it could be quite fast. All 
splits are axis-aligned.
         
+        Dot Product distance (contributed by `Peter Sobot 
<https://github.com/psobot>`__) reduces the provided vectors from dot (or 
"inner-product") space to a more query-friendly cosine space using `a method by 
Bachrach et al., at Microsoft Research, published in 2014 
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf>`__.
+        
+        
+        
         More info
         ---------
         
@@ -146,7 +156,7 @@
         * Radim Řehůřek's blog posts comparing Annoy to a couple of other 
similar Python libraries: `Intro 
<http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro/>`__,
 `Contestants 
<http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/>`__,
 `Querying 
<http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/>`__
         * `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`__ is a 
benchmark for several approximate nearest neighbor libraries. Annoy seems to be 
fairly competitive, especially at higher precisions:
         
-        .. figure:: 
https://raw.github.com/erikbern/ann-benchmarks/master/results/glove.png
+        .. figure:: 
https://github.com/erikbern/ann-benchmarks/raw/master/results/glove-100-angular.png
            :alt: ANN benchmarks
            :align: center
            :target: https://github.com/erikbern/ann-benchmarks
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/README.rst new/annoy-1.15.1/README.rst
--- old/annoy-1.12.0/README.rst 2018-02-07 02:41:53.000000000 +0100
+++ new/annoy-1.15.1/README.rst 2019-02-21 17:39:33.000000000 +0100
@@ -21,7 +21,7 @@
 Install
 -------
 
-To install, simply do ``sudo pip install annoy`` to pull down the latest 
version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
+To install, simply do ``pip install --user annoy`` to pull down the latest 
version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
 
 For the C++ version, just clone the repo and ``#include "annoylib.h"``.
 
@@ -30,7 +30,7 @@
 
 There are some other libraries to do nearest neighbor search. Annoy is almost 
as fast as the fastest libraries, (see below), but there is actually another 
feature that really sets Annoy apart: it has the ability to **use static files 
as indexes**. In particular, this means you can **share index across 
processes**. Annoy also decouples creating indexes from loading them, so you 
can pass around indexes as files and map them into memory quickly. Another nice 
thing of Annoy is that it tries to minimize memory footprint so the indexes are 
quite small.
 
-Why is this useful? If you want to find nearest neighbors and you have many 
CPU's, you only need the RAM to fit the index once. You can also pass around 
and distribute static files to use in production environment, in Hadoop jobs, 
etc. Any process will be able to load (mmap) the index into memory and will be 
able to do lookups immediately.
+Why is this useful? If you want to find nearest neighbors and you have many 
CPU's, you only need to build the index once. You can also pass around and 
distribute static files to use in production environment, in Hadoop jobs, etc. 
Any process will be able to load (mmap) the index into memory and will be able 
to do lookups immediately.
 
 We use it at `Spotify <http://www.spotify.com/>`__ for music recommendations. 
After running matrix factorization algorithms, every user/item can be 
represented as a vector in f-dimensional space. This library helps us search 
for similar users/items. We have many millions of tracks in a high-dimensional 
space, so memory usage is a prime concern.
 
@@ -39,13 +39,14 @@
 Summary of features
 -------------------
 
-* `Euclidean distance <https://en.wikipedia.org/wiki/Euclidean_distance>`__, 
`Manhattan distance <https://en.wikipedia.org/wiki/Taxicab_geometry>`__, 
`cosine distance <https://en.wikipedia.org/wiki/Cosine_similarity>`__, or 
`Hamming distance <https://en.wikipedia.org/wiki/Hamming_distance>`__
+* `Euclidean distance <https://en.wikipedia.org/wiki/Euclidean_distance>`__, 
`Manhattan distance <https://en.wikipedia.org/wiki/Taxicab_geometry>`__, 
`cosine distance <https://en.wikipedia.org/wiki/Cosine_similarity>`__, `Hamming 
distance <https://en.wikipedia.org/wiki/Hamming_distance>`__, or `Dot (Inner) 
Product distance <https://en.wikipedia.org/wiki/Dot_product>`__
 * Cosine distance is equivalent to Euclidean distance of normalized vectors = 
sqrt(2-2*cos(u, v))
 * Works better if you don't have too many dimensions (like <100) but seems to 
perform surprisingly well even up to 1,000 dimensions
 * Small memory usage
 * Lets you share memory between multiple processes
 * Index creation is separate from lookup (in particular you can not add more 
items once the tree has been created)
-* Native Python support, tested with 2.6, 2.7, 3.3, 3.4, 3.5
+* Native Python support, tested with 2.7, 3.6, and 3.7.
+* Build index on disk to enable indexing big datasets that won't fit into 
memory (contributed by `Rene Hollander <https://github.com/ReneHollander>`__)
 
 Python code example
 -------------------
@@ -75,17 +76,19 @@
 Full Python API
 ---------------
 
-* ``AnnoyIndex(f, metric='angular')`` returns a new index that's read-write 
and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, or ``"hamming"``.
+* ``AnnoyIndex(f, metric='angular')`` returns a new index that's read-write 
and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, ``"hamming"``, or ``"dot"``.
 * ``a.add_item(i, v)`` adds item ``i`` (any nonnegative integer) with vector 
``v``. Note that it will allocate memory for ``max(i)+1`` items.
 * ``a.build(n_trees)`` builds a forest of ``n_trees`` trees. More trees gives 
higher precision when querying. After calling ``build``, no more items can be 
added.
-* ``a.save(fn)`` saves the index to disk.
-* ``a.load(fn)`` loads (mmaps) an index from disk.
+* ``a.save(fn, prefault=False)`` saves the index to disk and loads it (see 
next function). After saving, no more items can be added.
+* ``a.load(fn, prefault=False)`` loads (mmaps) an index from disk. If 
`prefault` is set to `True`, it will pre-read the entire file into memory 
(using mmap with `MAP_POPULATE`). Default is `False`.
 * ``a.unload()`` unloads.
 * ``a.get_nns_by_item(i, n, search_k=-1, include_distances=False)`` returns 
the ``n`` closest items. During the query it will inspect up to ``search_k`` 
nodes which defaults to ``n_trees * n`` if not provided. ``search_k`` gives you 
a run-time tradeoff between better accuracy and speed. If you set 
``include_distances`` to ``True``, it will return a 2 element tuple with two 
lists in it: the second one containing all corresponding distances.
 * ``a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)`` same but 
query by vector ``v``.
 * ``a.get_item_vector(i)`` returns the vector for item ``i`` that was 
previously added.
 * ``a.get_distance(i, j)`` returns the distance between items ``i`` and ``j``. 
NOTE: this used to return the *squared* distance, but has been changed as of 
Aug 2016.
 * ``a.get_n_items()`` returns the number of items in the index.
+* ``a.get_n_trees()`` returns the number of trees in the index.
+* ``a.on_disk_build(fn)`` prepares annoy to build the index in the specified 
file instead of RAM (execute before adding items, no need to save after build)
 
 Notes:
 
@@ -98,12 +101,15 @@
 Tradeoffs
 ---------
 
-There are just two parameters you can use to tune Annoy: the number of trees 
``n_trees`` and the number of nodes to inspect during searching ``search_k``.
+There are just two main parameters needed to tune Annoy: the number of trees 
``n_trees`` and the number of nodes to inspect during searching ``search_k``.
 
 * ``n_trees`` is provided during build time and affects the build time and the 
index size. A larger value will give more accurate results, but larger indexes.
 * ``search_k`` is provided in runtime and affects the search performance. A 
larger value will give more accurate results, but will take longer time to 
return.
 
-If ``search_k`` is not provided, it will default to ``n * n_trees`` where 
``n`` is the number of approximate nearest neighbors. Otherwise, ``search_k`` 
and ``n_trees`` are roughly independent, i.e. a the value of ``n_trees`` will 
not affect search time if ``search_k`` is held constant and vice versa. 
Basically it's recommended to set ``n_trees`` as large as possible given the 
amount of memory you can afford, and it's recommended to set ``search_k`` as 
large as possible given the time constraints you have for the queries.
+If ``search_k`` is not provided, it will default to ``n * n_trees * D`` where 
``n`` is the number of approximate nearest neighbors and ``D`` is a constant 
depending on the metric. Otherwise, ``search_k`` and ``n_trees`` are roughly 
independent, i.e. a the value of ``n_trees`` will not affect search time if 
``search_k`` is held constant and vice versa. Basically it's recommended to set 
``n_trees`` as large as possible given the amount of memory you can afford, and 
it's recommended to set ``search_k`` as large as possible given the time 
constraints you have for the queries.
+
+You can also accept slower search times in favour of reduced loading times, 
memory usage, and disk IO. On supported platforms the index is prefaulted 
during ``load`` and ``save``, causing the file to be pre-emptively read from 
disk into memory. If you set ``prefault`` to ``False``, pages of the mmapped 
index are instead read from disk and cached in memory on-demand, as necessary 
for a search to complete. This can significantly increase early search times 
but may be better suited for systems with low memory compared to index size, 
when few queries are executed against a loaded index, and/or when large areas 
of the index are unlikely to be relevant to search queries.
+
 
 How does it work
 ----------------
@@ -114,6 +120,10 @@
 
 Hamming distance (contributed by `Martin Aumüller 
<https://github.com/maumueller>`__) packs the data into 64-bit integers under 
the hood and uses built-in bit count primitives so it could be quite fast. All 
splits are axis-aligned.
 
+Dot Product distance (contributed by `Peter Sobot 
<https://github.com/psobot>`__) reduces the provided vectors from dot (or 
"inner-product") space to a more query-friendly cosine space using `a method by 
Bachrach et al., at Microsoft Research, published in 2014 
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf>`__.
+
+
+
 More info
 ---------
 
@@ -128,7 +138,7 @@
 * Radim Řehůřek's blog posts comparing Annoy to a couple of other similar 
Python libraries: `Intro 
<http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro/>`__,
 `Contestants 
<http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/>`__,
 `Querying 
<http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/>`__
 * `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`__ is a 
benchmark for several approximate nearest neighbor libraries. Annoy seems to be 
fairly competitive, especially at higher precisions:
 
-.. figure:: 
https://raw.github.com/erikbern/ann-benchmarks/master/results/glove.png
+.. figure:: 
https://github.com/erikbern/ann-benchmarks/raw/master/results/glove-100-angular.png
    :alt: ANN benchmarks
    :align: center
    :target: https://github.com/erikbern/ann-benchmarks
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/annoy.egg-info/PKG-INFO 
new/annoy-1.15.1/annoy.egg-info/PKG-INFO
--- old/annoy-1.12.0/annoy.egg-info/PKG-INFO    2018-05-07 05:10:09.000000000 
+0200
+++ new/annoy-1.15.1/annoy.egg-info/PKG-INFO    2019-02-22 17:11:43.000000000 
+0100
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: annoy
-Version: 1.12.0
+Version: 1.15.1
 Summary: Approximate Nearest Neighbors in C++/Python optimized for memory 
usage and loading/saving to disk.
 Home-page: https://github.com/spotify/annoy
 Author: Erik Bernhardsson
@@ -39,7 +39,7 @@
         Install
         -------
         
-        To install, simply do ``sudo pip install annoy`` to pull down the 
latest version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
+        To install, simply do ``pip install --user annoy`` to pull down the 
latest version from `PyPI <https://pypi.python.org/pypi/annoy>`_.
         
         For the C++ version, just clone the repo and ``#include "annoylib.h"``.
         
@@ -48,7 +48,7 @@
         
         There are some other libraries to do nearest neighbor search. Annoy is 
almost as fast as the fastest libraries, (see below), but there is actually 
another feature that really sets Annoy apart: it has the ability to **use 
static files as indexes**. In particular, this means you can **share index 
across processes**. Annoy also decouples creating indexes from loading them, so 
you can pass around indexes as files and map them into memory quickly. Another 
nice thing of Annoy is that it tries to minimize memory footprint so the 
indexes are quite small.
         
-        Why is this useful? If you want to find nearest neighbors and you have 
many CPU's, you only need the RAM to fit the index once. You can also pass 
around and distribute static files to use in production environment, in Hadoop 
jobs, etc. Any process will be able to load (mmap) the index into memory and 
will be able to do lookups immediately.
+        Why is this useful? If you want to find nearest neighbors and you have 
many CPU's, you only need to build the index once. You can also pass around and 
distribute static files to use in production environment, in Hadoop jobs, etc. 
Any process will be able to load (mmap) the index into memory and will be able 
to do lookups immediately.
         
         We use it at `Spotify <http://www.spotify.com/>`__ for music 
recommendations. After running matrix factorization algorithms, every user/item 
can be represented as a vector in f-dimensional space. This library helps us 
search for similar users/items. We have many millions of tracks in a 
high-dimensional space, so memory usage is a prime concern.
         
@@ -57,13 +57,14 @@
         Summary of features
         -------------------
         
-        * `Euclidean distance 
<https://en.wikipedia.org/wiki/Euclidean_distance>`__, `Manhattan distance 
<https://en.wikipedia.org/wiki/Taxicab_geometry>`__, `cosine distance 
<https://en.wikipedia.org/wiki/Cosine_similarity>`__, or `Hamming distance 
<https://en.wikipedia.org/wiki/Hamming_distance>`__
+        * `Euclidean distance 
<https://en.wikipedia.org/wiki/Euclidean_distance>`__, `Manhattan distance 
<https://en.wikipedia.org/wiki/Taxicab_geometry>`__, `cosine distance 
<https://en.wikipedia.org/wiki/Cosine_similarity>`__, `Hamming distance 
<https://en.wikipedia.org/wiki/Hamming_distance>`__, or `Dot (Inner) Product 
distance <https://en.wikipedia.org/wiki/Dot_product>`__
         * Cosine distance is equivalent to Euclidean distance of normalized 
vectors = sqrt(2-2*cos(u, v))
         * Works better if you don't have too many dimensions (like <100) but 
seems to perform surprisingly well even up to 1,000 dimensions
         * Small memory usage
         * Lets you share memory between multiple processes
         * Index creation is separate from lookup (in particular you can not 
add more items once the tree has been created)
-        * Native Python support, tested with 2.6, 2.7, 3.3, 3.4, 3.5
+        * Native Python support, tested with 2.7, 3.6, and 3.7.
+        * Build index on disk to enable indexing big datasets that won't fit 
into memory (contributed by `Rene Hollander 
<https://github.com/ReneHollander>`__)
         
         Python code example
         -------------------
@@ -93,17 +94,19 @@
         Full Python API
         ---------------
         
-        * ``AnnoyIndex(f, metric='angular')`` returns a new index that's 
read-write and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, or ``"hamming"``.
+        * ``AnnoyIndex(f, metric='angular')`` returns a new index that's 
read-write and stores vector of ``f`` dimensions. Metric can be ``"angular"``, 
``"euclidean"``, ``"manhattan"``, ``"hamming"``, or ``"dot"``.
         * ``a.add_item(i, v)`` adds item ``i`` (any nonnegative integer) with 
vector ``v``. Note that it will allocate memory for ``max(i)+1`` items.
         * ``a.build(n_trees)`` builds a forest of ``n_trees`` trees. More 
trees gives higher precision when querying. After calling ``build``, no more 
items can be added.
-        * ``a.save(fn)`` saves the index to disk.
-        * ``a.load(fn)`` loads (mmaps) an index from disk.
+        * ``a.save(fn, prefault=False)`` saves the index to disk and loads it 
(see next function). After saving, no more items can be added.
+        * ``a.load(fn, prefault=False)`` loads (mmaps) an index from disk. If 
`prefault` is set to `True`, it will pre-read the entire file into memory 
(using mmap with `MAP_POPULATE`). Default is `False`.
         * ``a.unload()`` unloads.
         * ``a.get_nns_by_item(i, n, search_k=-1, include_distances=False)`` 
returns the ``n`` closest items. During the query it will inspect up to 
``search_k`` nodes which defaults to ``n_trees * n`` if not provided. 
``search_k`` gives you a run-time tradeoff between better accuracy and speed. 
If you set ``include_distances`` to ``True``, it will return a 2 element tuple 
with two lists in it: the second one containing all corresponding distances.
         * ``a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)`` 
same but query by vector ``v``.
         * ``a.get_item_vector(i)`` returns the vector for item ``i`` that was 
previously added.
         * ``a.get_distance(i, j)`` returns the distance between items ``i`` 
and ``j``. NOTE: this used to return the *squared* distance, but has been 
changed as of Aug 2016.
         * ``a.get_n_items()`` returns the number of items in the index.
+        * ``a.get_n_trees()`` returns the number of trees in the index.
+        * ``a.on_disk_build(fn)`` prepares annoy to build the index in the 
specified file instead of RAM (execute before adding items, no need to save 
after build)
         
         Notes:
         
@@ -116,12 +119,15 @@
         Tradeoffs
         ---------
         
-        There are just two parameters you can use to tune Annoy: the number of 
trees ``n_trees`` and the number of nodes to inspect during searching 
``search_k``.
+        There are just two main parameters needed to tune Annoy: the number of 
trees ``n_trees`` and the number of nodes to inspect during searching 
``search_k``.
         
         * ``n_trees`` is provided during build time and affects the build time 
and the index size. A larger value will give more accurate results, but larger 
indexes.
         * ``search_k`` is provided in runtime and affects the search 
performance. A larger value will give more accurate results, but will take 
longer time to return.
         
-        If ``search_k`` is not provided, it will default to ``n * n_trees`` 
where ``n`` is the number of approximate nearest neighbors. Otherwise, 
``search_k`` and ``n_trees`` are roughly independent, i.e. a the value of 
``n_trees`` will not affect search time if ``search_k`` is held constant and 
vice versa. Basically it's recommended to set ``n_trees`` as large as possible 
given the amount of memory you can afford, and it's recommended to set 
``search_k`` as large as possible given the time constraints you have for the 
queries.
+        If ``search_k`` is not provided, it will default to ``n * n_trees * 
D`` where ``n`` is the number of approximate nearest neighbors and ``D`` is a 
constant depending on the metric. Otherwise, ``search_k`` and ``n_trees`` are 
roughly independent, i.e. a the value of ``n_trees`` will not affect search 
time if ``search_k`` is held constant and vice versa. Basically it's 
recommended to set ``n_trees`` as large as possible given the amount of memory 
you can afford, and it's recommended to set ``search_k`` as large as possible 
given the time constraints you have for the queries.
+        
+        You can also accept slower search times in favour of reduced loading 
times, memory usage, and disk IO. On supported platforms the index is 
prefaulted during ``load`` and ``save``, causing the file to be pre-emptively 
read from disk into memory. If you set ``prefault`` to ``False``, pages of the 
mmapped index are instead read from disk and cached in memory on-demand, as 
necessary for a search to complete. This can significantly increase early 
search times but may be better suited for systems with low memory compared to 
index size, when few queries are executed against a loaded index, and/or when 
large areas of the index are unlikely to be relevant to search queries.
+        
         
         How does it work
         ----------------
@@ -132,6 +138,10 @@
         
         Hamming distance (contributed by `Martin Aumüller 
<https://github.com/maumueller>`__) packs the data into 64-bit integers under 
the hood and uses built-in bit count primitives so it could be quite fast. All 
splits are axis-aligned.
         
+        Dot Product distance (contributed by `Peter Sobot 
<https://github.com/psobot>`__) reduces the provided vectors from dot (or 
"inner-product") space to a more query-friendly cosine space using `a method by 
Bachrach et al., at Microsoft Research, published in 2014 
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf>`__.
+        
+        
+        
         More info
         ---------
         
@@ -146,7 +156,7 @@
         * Radim Řehůřek's blog posts comparing Annoy to a couple of other 
similar Python libraries: `Intro 
<http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro/>`__,
 `Contestants 
<http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/>`__,
 `Querying 
<http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/>`__
         * `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`__ is a 
benchmark for several approximate nearest neighbor libraries. Annoy seems to be 
fairly competitive, especially at higher precisions:
         
-        .. figure:: 
https://raw.github.com/erikbern/ann-benchmarks/master/results/glove.png
+        .. figure:: 
https://github.com/erikbern/ann-benchmarks/raw/master/results/glove-100-angular.png
            :alt: ANN benchmarks
            :align: center
            :target: https://github.com/erikbern/ann-benchmarks
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/setup.py new/annoy-1.15.1/setup.py
--- old/annoy-1.12.0/setup.py   2018-05-07 05:09:53.000000000 +0200
+++ new/annoy-1.15.1/setup.py   2019-02-22 17:11:27.000000000 +0100
@@ -18,6 +18,7 @@
 from setuptools import setup, Extension
 import codecs
 import os
+import platform
 import sys
 
 readme_note = """\
@@ -34,33 +35,37 @@
 with codecs.open('README.rst', encoding='utf-8') as fobj:
     long_description = readme_note + fobj.read()
 
+# Various platform-dependent extras
+extra_compile_args = []
+extra_link_args = []
+
 if os.environ.get('TRAVIS') == 'true':
     # Resolving some annoying issue
-    travis_extra_compile_args = ['-mno-avx']
-else:
-    travis_extra_compile_args = []
+    extra_compile_args += ['-mno-avx']
 
 # Not all CPUs have march as a tuning parameter
-import platform
 cputune = ['-march=native',]
-if platform.machine() == "ppc64le":
-    cputune = ['-mcpu=native',]
+if platform.machine() == 'ppc64le':
+    extra_compile_args += ['-mcpu=native',]
 
 if os.name != 'nt':
-    compile_args = ['-O3', '-ffast-math', '-fno-associative-math']
-else:
-    compile_args = []
-    cputune = []
+    extra_compile_args += ['-O3', '-ffast-math', '-fno-associative-math']
+
+# #349: something with OS X Mojave causes libstd not to be found
+if platform.system() == 'Darwin':
+    extra_compile_args += ['-std=c++11', '-mmacosx-version-min=10.9']
+    extra_link_args += ['-stdlib=libc++', '-mmacosx-version-min=10.9']
 
 setup(name='annoy',
-      version='1.12.0',
+      version='1.15.1',
       description='Approximate Nearest Neighbors in C++/Python optimized for 
memory usage and loading/saving to disk.',
       packages=['annoy'],
       ext_modules=[
         Extension(
             'annoy.annoylib', ['src/annoymodule.cc'],
             depends=['src/annoylib.h', 'src/kissrandom.h', 'src/mman.h'],
-            extra_compile_args=compile_args + cputune + 
travis_extra_compile_args
+            extra_compile_args=extra_compile_args,
+            extra_link_args=extra_link_args,
         )
       ],
       long_description=long_description,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/src/annoylib.h 
new/annoy-1.15.1/src/annoylib.h
--- old/annoy-1.12.0/src/annoylib.h     2018-05-07 04:38:57.000000000 +0200
+++ new/annoy-1.15.1/src/annoylib.h     2019-02-21 17:39:33.000000000 +0100
@@ -17,7 +17,6 @@
 #define ANNOYLIB_H
 
 #include <stdio.h>
-#include <string>
 #include <sys/stat.h>
 #ifndef _MSC_VER
 #include <unistd.h>
@@ -30,16 +29,19 @@
 #if defined(_MSC_VER) && _MSC_VER == 1500
 typedef unsigned char     uint8_t;
 typedef signed __int32    int32_t;
+typedef unsigned __int64  uint64_t;
 #else
 #include <stdint.h>
 #endif
 
-#ifdef _MSC_VER
-#define NOMINMAX
-#include "mman.h"
-#include <windows.h>
+#if defined(_MSC_VER) || defined(__MINGW32__)
+ #ifndef NOMINMAX
+  #define NOMINMAX
+ #endif
+ #include "mman.h"
+ #include <windows.h>
 #else
-#include <sys/mman.h>
+ #include <sys/mman.h>
 #endif
 
 #include <string.h>
@@ -65,6 +67,9 @@
 
 #ifndef _MSC_VER
 #define popcount __builtin_popcountll
+#elif _MSC_VER == 1500
+#define isnan(x) _isnan(x)
+#define popcount cole_popcount
 #else
 #define popcount __popcnt64
 #endif
@@ -94,13 +99,31 @@
 
 
 using std::vector;
-using std::string;
 using std::pair;
 using std::numeric_limits;
 using std::make_pair;
 
+inline void* remap_memory(void* _ptr, int _fd, size_t old_size, size_t 
new_size) {
+#ifdef __linux__
+  _ptr = mremap(_ptr, old_size, new_size, MREMAP_MAYMOVE);
+#else
+  munmap(_ptr, old_size);
+#ifdef MAP_POPULATE
+  _ptr = mmap(_ptr, new_size, PROT_READ | PROT_WRITE, MAP_SHARED | 
MAP_POPULATE, _fd, 0);
+#else
+  _ptr = mmap(_ptr, new_size, PROT_READ | PROT_WRITE, MAP_SHARED, _fd, 0);
+#endif
+#endif
+  return _ptr;
+}
+
 namespace {
 
+template<typename S, typename Node>
+inline Node* get_node_ptr(const void* _nodes, const size_t _s, const S i) {
+  return (Node*)((uint8_t *)_nodes + (_s * i));
+}
+
 template<typename T>
 inline T dot(const T* x, const T* y, int f) {
   T s = 0;
@@ -120,6 +143,19 @@
   return d;
 }
 
+template<typename T>
+inline T euclidean_distance(const T* x, const T* y, int f) {
+  // Don't use dot-product: avoid catastrophic cancellation in #314.
+  T d = 0.0;
+  for (int i = 0; i < f; ++i) {
+    const T tmp=*x - *y;
+    d += tmp * tmp;
+    ++x;
+    ++y;
+  }
+  return d;
+}
+
 #ifdef USE_AVX
 // Horizontal single sum of 256bit vector.
 inline float hsum256_ps_avx(__m256 v) {
@@ -177,6 +213,30 @@
   return result;
 }
 
+template<>
+inline float euclidean_distance<float>(const float* x, const float* y, int f) {
+  float result=0;
+  if (f > 7) {
+    __m256 d = _mm256_setzero_ps();
+    for (; f > 7; f -= 8) {
+      const __m256 diff = _mm256_sub_ps(_mm256_loadu_ps(x), 
_mm256_loadu_ps(y));
+      d = _mm256_add_ps(d, _mm256_mul_ps(diff, diff)); // no support for fmadd 
in AVX...
+      x += 8;
+      y += 8;
+    }
+    // Sum all floats in dot register.
+    result = hsum256_ps_avx(d);
+  }
+  // Don't forget the remaining values.
+  for (; f > 0; f--) {
+    float tmp = *x - *y;
+    result += tmp * tmp;
+    x++;
+    y++;
+  }
+  return result;
+}
+
 #endif
 
  
@@ -185,15 +245,6 @@
   return sqrt(dot(v, v, f));
 }
 
-template<typename T>
-inline void normalize(T* v, int f) {
-  T norm = get_norm(v, f);
-  if (norm > 0) {
-    for (int z = 0; z < f; z++)
-      v[z] /= norm;
-  }
-}
-
 template<typename T, typename Random, typename Distance, typename Node>
 inline void two_means(const vector<Node*>& nodes, int f, Random& random, bool 
cosine, Node* p, Node* q) {
   /*
@@ -208,9 +259,11 @@
   size_t i = random.index(count);
   size_t j = random.index(count-1);
   j += (j >= i); // ensure that i != j
-  memcpy(p->v, nodes[i]->v, f * sizeof(T));
-  memcpy(q->v, nodes[j]->v, f * sizeof(T));
-  if (cosine) { normalize(p->v, f); normalize(q->v, f); }
+
+  Distance::template copy_node<T, Node>(p, nodes[i], f);
+  Distance::template copy_node<T, Node>(q, nodes[j], f);
+
+  if (cosine) { Distance::template normalize<T, Node>(p, f); 
Distance::template normalize<T, Node>(q, f); }
   Distance::init_node(p, f);
   Distance::init_node(q, f);
 
@@ -225,21 +278,47 @@
     }
     if (di < dj) {
       for (int z = 0; z < f; z++)
-       p->v[z] = (p->v[z] * ic + nodes[k]->v[z] / norm) / (ic + 1);
+        p->v[z] = (p->v[z] * ic + nodes[k]->v[z] / norm) / (ic + 1);
       Distance::init_node(p, f);
       ic++;
     } else if (dj < di) {
       for (int z = 0; z < f; z++)
-       q->v[z] = (q->v[z] * jc + nodes[k]->v[z] / norm) / (jc + 1);
+        q->v[z] = (q->v[z] * jc + nodes[k]->v[z] / norm) / (jc + 1);
       Distance::init_node(q, f);
       jc++;
     }
   }
 }
-
 } // namespace
 
-struct Angular {
+struct Base {
+  template<typename T, typename S, typename Node>
+  static inline void preprocess(void* nodes, size_t _s, const S node_count, 
const int f) {
+    // Override this in specific metric structs below if you need to do any 
pre-processing
+    // on the entire set of nodes passed into this index.
+  }
+
+  template<typename Node>
+  static inline void zero_value(Node* dest) {
+    // Initialize any fields that require sane defaults within this node.
+  }
+
+  template<typename T, typename Node>
+  static inline void copy_node(Node* dest, const Node* source, const int f) {
+    memcpy(dest->v, source->v, f * sizeof(T));
+  }
+
+  template<typename T, typename Node>
+  static inline void normalize(Node* node, int f) {
+    T norm = get_norm(node->v, f);
+    if (norm > 0) {
+      for (int z = 0; z < f; z++)
+        node->v[z] /= norm;
+    }
+  }
+};
+
+struct Angular : Base {
   template<typename S, typename T>
   struct ANNOY_NODE_ATTRIBUTE Node {
     /*
@@ -294,7 +373,7 @@
     two_means<T, Random, Angular, Node<S, T> >(nodes, f, random, true, p, q);
     for (int z = 0; z < f; z++)
       n->v[z] = p->v[z] - q->v[z];
-    normalize(n->v, f);
+    Base::normalize<T, Node<S, T> >(n, f);
     free(p);
     free(q);
   }
@@ -324,7 +403,122 @@
   }
 };
 
-struct Hamming {
+
+struct DotProduct : Angular {
+  template<typename S, typename T>
+  struct ANNOY_NODE_ATTRIBUTE Node {
+    /*
+     * This is an extension of the Angular node with an extra attribute for 
the scaled norm.
+     */
+    S n_descendants;
+    S children[2]; // Will possibly store more than 2
+    T dot_factor;
+    T v[1]; // We let this one overflow intentionally. Need to allocate at 
least 1 to make GCC happy
+  };
+
+  static const char* name() {
+    return "dot";
+  }
+  template<typename S, typename T>
+  static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
+    return -dot(x->v, y->v, f);
+  }
+
+  template<typename Node>
+  static inline void zero_value(Node* dest) {
+    dest->dot_factor = 0;
+  }
+
+  template<typename S, typename T>
+  static inline void init_node(Node<S, T>* n, int f) {
+  }
+
+  template<typename T, typename Node>
+  static inline void copy_node(Node* dest, const Node* source, const int f) {
+    memcpy(dest->v, source->v, f * sizeof(T));
+    dest->dot_factor = source->dot_factor;
+  }
+
+  template<typename S, typename T, typename Random>
+  static inline void create_split(const vector<Node<S, T>*>& nodes, int f, 
size_t s, Random& random, Node<S, T>* n) {
+    Node<S, T>* p = (Node<S, T>*)malloc(s); // TODO: avoid
+    Node<S, T>* q = (Node<S, T>*)malloc(s); // TODO: avoid
+    DotProduct::zero_value(p); 
+    DotProduct::zero_value(q);
+    two_means<T, Random, DotProduct, Node<S, T> >(nodes, f, random, true, p, 
q);
+    for (int z = 0; z < f; z++)
+      n->v[z] = p->v[z] - q->v[z];
+    n->dot_factor = p->dot_factor - q->dot_factor;
+    DotProduct::normalize<T, Node<S, T> >(n, f);
+    free(p);
+    free(q);
+  }
+
+  template<typename T, typename Node>
+  static inline void normalize(Node* node, int f) {
+    T norm = sqrt(dot(node->v, node->v, f) + pow(node->dot_factor, 2));
+    if (norm > 0) {
+      for (int z = 0; z < f; z++)
+        node->v[z] /= norm;
+      node->dot_factor /= norm;
+    }
+  }
+
+  template<typename S, typename T>
+  static inline T margin(const Node<S, T>* n, const T* y, int f) {
+    return dot(n->v, y, f) + (n->dot_factor * n->dot_factor);
+  }
+
+  template<typename S, typename T, typename Random>
+  static inline bool side(const Node<S, T>* n, const T* y, int f, Random& 
random) {
+    T dot = margin(n, y, f);
+    if (dot != 0)
+      return (dot > 0);
+    else
+      return random.flip();
+  }
+
+  template<typename T>
+  static inline T normalized_distance(T distance) {
+    return -distance;
+  }
+
+  template<typename T, typename S, typename Node>
+  static inline void preprocess(void* nodes, size_t _s, const S node_count, 
const int f) {
+    // This uses a method from Microsoft Research for transforming inner 
product spaces to cosine/angular-compatible spaces.
+    // (Bachrach et al., 2014, see 
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf)
+
+    // Step one: compute the norm of each vector and store that in its extra 
dimension (f-1)
+    for (S i = 0; i < node_count; i++) {
+      Node* node = get_node_ptr<S, Node>(nodes, _s, i);
+      T norm = sqrt(dot(node->v, node->v, f));
+      if (isnan(norm)) norm = 0;
+      node->dot_factor = norm;
+    }
+
+    // Step two: find the maximum norm
+    T max_norm = 0;
+    for (S i = 0; i < node_count; i++) {
+      Node* node = get_node_ptr<S, Node>(nodes, _s, i);
+      if (node->dot_factor > max_norm) {
+        max_norm = node->dot_factor;
+      }
+    }
+
+    // Step three: set each vector's extra dimension to sqrt(max_norm^2 - 
norm^2)
+    for (S i = 0; i < node_count; i++) {
+      Node* node = get_node_ptr<S, Node>(nodes, _s, i);
+      T node_norm = node->dot_factor;
+
+      T dot_factor = sqrt(pow(max_norm, static_cast<T>(2.0)) - pow(node_norm, 
static_cast<T>(2.0)));
+      if (isnan(dot_factor)) dot_factor = 0;
+
+      node->dot_factor = dot_factor;
+    }
+  }
+};
+
+struct Hamming : Base {
   template<typename S, typename T>
   struct ANNOY_NODE_ATTRIBUTE Node {
     S n_descendants;
@@ -343,6 +537,17 @@
   static inline T pq_initial_value() {
     return numeric_limits<T>::max();
   }
+  template<typename T>
+  static inline int cole_popcount(T v) {
+    // Note: Only used with MSVC 9, which lacks intrinsics and fails to
+    // calculate std::bitset::count for v > 32bit. Uses the generalized
+    // approach by Eric Cole.
+    // See https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
+    v = v - ((v >> 1) & (T)~(T)0/3);
+    v = (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3);
+    v = (v + (v >> 4)) & (T)~(T)0/255*15;
+    return (T)(v * ((T)~(T)0/255)) >> (sizeof(T) - 1) * 8;
+  }
   template<typename S, typename T>
   static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
     size_t dist = 0;
@@ -408,15 +613,13 @@
   }
 };
 
-struct Minkowski {
+
+struct Minkowski : Base {
   template<typename S, typename T>
   struct ANNOY_NODE_ATTRIBUTE Node {
     S n_descendants;
     T a; // need an extra constant term to determine the offset of the plane
-    union {
-      S children[2];
-      T norm;
-    };
+    S children[2];
     T v[1];
   };
   template<typename S, typename T>
@@ -444,13 +647,10 @@
 };
 
 
-struct Euclidean : Minkowski{
+struct Euclidean : Minkowski {
   template<typename S, typename T>
   static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
-    T pp = x->norm ? x->norm : dot(x->v, x->v, f); // For backwards 
compatibility reasons, we need to fall back and compute the norm here
-    T qq = y->norm ? y->norm : dot(y->v, y->v, f);
-    T pq = dot(x->v, y->v, f);
-    return pp + qq - 2*pq;
+    return euclidean_distance(x->v, y->v, f);    
   }
   template<typename S, typename T, typename Random>
   static inline void create_split(const vector<Node<S, T>*>& nodes, int f, 
size_t s, Random& random, Node<S, T>* n) {
@@ -460,7 +660,7 @@
 
     for (int z = 0; z < f; z++)
       n->v[z] = p->v[z] - q->v[z];
-    normalize(n->v, f);
+    Base::normalize<T, Node<S, T> >(n, f);
     n->a = 0.0;
     for (int z = 0; z < f; z++)
       n->a += -n->v[z] * (p->v[z] + q->v[z]) / 2;
@@ -473,14 +673,14 @@
   }
   template<typename S, typename T>
   static inline void init_node(Node<S, T>* n, int f) {
-    n->norm = dot(n->v, n->v, f);
   }
   static const char* name() {
     return "euclidean";
   }
+
 };
 
-struct Manhattan : Minkowski{
+struct Manhattan : Minkowski {
   template<typename S, typename T>
   static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
     return manhattan_distance(x->v, y->v, f);
@@ -493,7 +693,7 @@
 
     for (int z = 0; z < f; z++)
       n->v[z] = p->v[z] - q->v[z];
-    normalize(n->v, f);
+    Base::normalize<T, Node<S, T> >(n, f);
     n->a = 0.0;
     for (int z = 0; z < f; z++)
       n->a += -n->v[z] * (p->v[z] + q->v[z]) / 2;
@@ -519,16 +719,18 @@
   virtual void add_item(S item, const T* w) = 0;
   virtual void build(int q) = 0;
   virtual void unbuild() = 0;
-  virtual bool save(const char* filename) = 0;
+  virtual bool save(const char* filename, bool prefault=false) = 0;
   virtual void unload() = 0;
-  virtual bool load(const char* filename) = 0;
-  virtual T get_distance(S i, S j) = 0;
-  virtual void get_nns_by_item(S item, size_t n, size_t search_k, vector<S>* 
result, vector<T>* distances) = 0;
-  virtual void get_nns_by_vector(const T* w, size_t n, size_t search_k, 
vector<S>* result, vector<T>* distances) = 0;
-  virtual S get_n_items() = 0;
+  virtual bool load(const char* filename, bool prefault=false) = 0;
+  virtual T get_distance(S i, S j) const = 0;
+  virtual void get_nns_by_item(S item, size_t n, size_t search_k, vector<S>* 
result, vector<T>* distances) const = 0;
+  virtual void get_nns_by_vector(const T* w, size_t n, size_t search_k, 
vector<S>* result, vector<T>* distances) const = 0;
+  virtual S get_n_items() const = 0;
+  virtual S get_n_trees() const = 0;
   virtual void verbose(bool v) = 0;
-  virtual void get_item(S item, T* v) = 0;
+  virtual void get_item(S item, T* v) const = 0;
   virtual void set_seed(int q) = 0;
+  virtual bool on_disk_build(const char* filename) = 0;
 };
 
 template<typename S, typename T, typename Distance, typename Random>
@@ -557,12 +759,13 @@
   bool _loaded;
   bool _verbose;
   int _fd;
+  bool _on_disk;
 public:
 
   AnnoyIndex(int f) : _f(f), _random() {
-    _s = offsetof(Node, v) + f * sizeof(T); // Size of each node
+    _s = offsetof(Node, v) + _f * sizeof(T); // Size of each node
     _verbose = false;
-    _K = (_s - offsetof(Node, children)) / sizeof(S); // Max number of 
descendants to fit into node
+    _K = (S) (((size_t) (_s - offsetof(Node, children))) / sizeof(S)); // Max 
number of descendants to fit into node
     reinitialize(); // Reset everything
   }
   ~AnnoyIndex() {
@@ -582,24 +785,47 @@
     _allocate_size(item + 1);
     Node* n = _get(item);
 
+    D::zero_value(n);
+
     n->children[0] = 0;
     n->children[1] = 0;
     n->n_descendants = 1;
 
     for (int z = 0; z < _f; z++)
       n->v[z] = w[z];
+
     D::init_node(n, _f);
 
     if (item >= _n_items)
       _n_items = item + 1;
   }
-
+    
+  bool on_disk_build(const char* file) {
+    _on_disk = true;
+    _fd = open(file, O_RDWR | O_CREAT | O_TRUNC, (int) 0600);
+    if (_fd == -1) {
+      _fd = 0;
+      return false;
+    }
+    _nodes_size = 1;
+    ftruncate(_fd, _s * _nodes_size);
+#ifdef MAP_POPULATE
+    _nodes = (Node*) mmap(0, _s * _nodes_size, PROT_READ | PROT_WRITE, 
MAP_SHARED | MAP_POPULATE, _fd, 0);
+#else
+    _nodes = (Node*) mmap(0, _s * _nodes_size, PROT_READ | PROT_WRITE, 
MAP_SHARED, _fd, 0);
+#endif
+    return true;
+  }
+    
   void build(int q) {
     if (_loaded) {
       // TODO: throw exception
       showUpdate("You can't build a loaded index\n");
       return;
     }
+
+    D::template preprocess<T, S, Node>(_nodes, _s, _n_items, _f);
+
     _n_nodes = _n_items;
     while (1) {
       if (q == -1 && _n_nodes >= _n_items * 2)
@@ -610,12 +836,13 @@
 
       vector<S> indices;
       for (S i = 0; i < _n_items; i++) {
-       if (_get(i)->n_descendants >= 1) // Issue #223
+             if (_get(i)->n_descendants >= 1) // Issue #223
           indices.push_back(i);
       }
 
       _roots.push_back(_make_tree(indices, true));
     }
+
     // Also, copy the roots into the last segment of the array
     // This way we can load them faster without reading the whole file
     _allocate_size(_n_nodes + (S)_roots.size());
@@ -624,6 +851,12 @@
     _n_nodes += _roots.size();
 
     if (_verbose) showUpdate("has %d nodes\n", _n_nodes);
+    
+    if (_on_disk) {
+      _nodes = remap_memory(_nodes, _fd, _s * _nodes_size, _s * _n_nodes);
+      ftruncate(_fd, _s * _n_nodes);
+      _nodes_size = _n_nodes;
+    }
   }
   
   void unbuild() {
@@ -636,16 +869,23 @@
     _n_nodes = _n_items;
   }
 
-  bool save(const char* filename) {
-    FILE *f = fopen(filename, "wb");
-    if (f == NULL)
-      return false;
+  bool save(const char* filename, bool prefault=false) {
+    if (_on_disk) {
+      return true;
+    } else {
+      // Delete file if it already exists (See issue #335)
+      unlink(filename);
+
+      FILE *f = fopen(filename, "wb");
+      if (f == NULL)
+        return false;
 
-    fwrite(_nodes, _s, _n_nodes, f);
-    fclose(f);
+      fwrite(_nodes, _s, _n_nodes, f);
+      fclose(f);
 
-    unload();
-    return load(filename);
+      unload();
+      return load(filename, prefault=false);
+    }
   }
 
   void reinitialize() {
@@ -655,24 +895,29 @@
     _n_items = 0;
     _n_nodes = 0;
     _nodes_size = 0;
+    _on_disk = false;
     _roots.clear();
   }
 
   void unload() {
-    if (_fd) {
-      // we have mmapped data
+    if (_on_disk && _fd) {
       close(_fd);
-      off_t size = _n_nodes * _s;
-      munmap(_nodes, size);
-    } else if (_nodes) {
-      // We have heap allocated data
-      free(_nodes);
+      munmap(_nodes, _s * _nodes_size);
+    } else {
+      if (_fd) {
+        // we have mmapped data
+        close(_fd);
+        munmap(_nodes, _n_nodes * _s);
+      } else if (_nodes) {
+        // We have heap allocated data
+        free(_nodes);
+      }
     }
     reinitialize();
     if (_verbose) showUpdate("unloaded\n");
   }
 
-  bool load(const char* filename) {
+  bool load(const char* filename, bool prefault=false) {
     _fd = open(filename, O_RDONLY, (int)0400);
     if (_fd == -1) {
       _fd = 0;
@@ -680,8 +925,9 @@
     }
     off_t size = lseek(_fd, 0, SEEK_END);
 #ifdef MAP_POPULATE
+    const int populate = prefault ? MAP_POPULATE : 0;
     _nodes = (Node*)mmap(
-        0, size, PROT_READ, MAP_SHARED | MAP_POPULATE, _fd, 0);
+        0, size, PROT_READ, MAP_SHARED | populate, _fd, 0);
 #else
     _nodes = (Node*)mmap(
         0, size, PROT_READ, MAP_SHARED, _fd, 0);
@@ -710,28 +956,34 @@
     return true;
   }
 
-  T get_distance(S i, S j) {
+  T get_distance(S i, S j) const {
     return D::normalized_distance(D::distance(_get(i), _get(j), _f));
   }
 
-  void get_nns_by_item(S item, size_t n, size_t search_k, vector<S>* result, 
vector<T>* distances) {
+  void get_nns_by_item(S item, size_t n, size_t search_k, vector<S>* result, 
vector<T>* distances) const {
     const Node* m = _get(item);
     _get_all_nns(m->v, n, search_k, result, distances);
   }
 
-  void get_nns_by_vector(const T* w, size_t n, size_t search_k, vector<S>* 
result, vector<T>* distances) {
+  void get_nns_by_vector(const T* w, size_t n, size_t search_k, vector<S>* 
result, vector<T>* distances) const {
     _get_all_nns(w, n, search_k, result, distances);
   }
-  S get_n_items() {
+
+  S get_n_items() const {
     return _n_items;
   }
+
+  S get_n_trees() const {
+    return _roots.size();
+  }
+
   void verbose(bool v) {
     _verbose = v;
   }
 
-  void get_item(S item, T* v) {
+  void get_item(S item, T* v) const {
     Node* m = _get(item);
-    memcpy(v, m->v, _f * sizeof(T));
+    memcpy(v, m->v, (_f) * sizeof(T));
   }
 
   void set_seed(int seed) {
@@ -742,17 +994,24 @@
   void _allocate_size(S n) {
     if (n > _nodes_size) {
       const double reallocation_factor = 1.3;
-      S new_nodes_size = std::max(n,
-                                 (S)((_nodes_size + 1) * reallocation_factor));
-      if (_verbose) showUpdate("Reallocating to %d nodes\n", new_nodes_size);
-      _nodes = realloc(_nodes, _s * new_nodes_size);
-      memset((char *)_nodes + (_nodes_size * _s)/sizeof(char), 0, 
(new_nodes_size - _nodes_size) * _s);
+      S new_nodes_size = std::max(n, (S) ((_nodes_size + 1) * 
reallocation_factor));
+      void *old = _nodes;
+      
+      if (_on_disk) {
+        ftruncate(_fd, _s * new_nodes_size);
+        _nodes = remap_memory(_nodes, _fd, _s * _nodes_size, _s * 
new_nodes_size);
+      } else {
+        _nodes = realloc(_nodes, _s * new_nodes_size);
+        memset((char *) _nodes + (_nodes_size * _s) / sizeof(char), 0, 
(new_nodes_size - _nodes_size) * _s);
+      }
+      
       _nodes_size = new_nodes_size;
+      if (_verbose) showUpdate("Reallocating to %d nodes: old_address=%p, 
new_address=%p\n", new_nodes_size, old, _nodes);
     }
   }
 
-  inline Node* _get(S i) {
-    return (Node*)((uint8_t *)_nodes + (_s * i));
+  inline Node* _get(const S i) const {
+    return get_node_ptr<S, Node>(_nodes, _s, i);
   }
 
   S _make_tree(const vector<S >& indices, bool is_root) {
@@ -764,7 +1023,7 @@
     if (indices.size() == 1 && !is_root)
       return indices[0];
 
-    if (indices.size() <= (size_t)_K && (!is_root || _n_items <= (size_t)_K || 
indices.size() == 1)) {
+    if (indices.size() <= (size_t)_K && (!is_root || (size_t)_n_items <= 
(size_t)_K || indices.size() == 1)) {
       _allocate_size(_n_nodes + 1);
       S item = _n_nodes++;
       Node* m = _get(item);
@@ -773,7 +1032,9 @@
       // Using std::copy instead of a loop seems to resolve issues #3 and #13,
       // probably because gcc 4.8 goes overboard with optimizations.
       // Using memcpy instead of std::copy for MSVC compatibility. #235
-      memcpy(m->children, &indices[0], indices.size() * sizeof(S));
+      // Only copy when necessary to avoid crash in MSVC 9. #293
+      if (!indices.empty())
+        memcpy(m->children, &indices[0], indices.size() * sizeof(S));
       return item;
     }
 
@@ -795,11 +1056,16 @@
       if (n) {
         bool side = D::side(m, n->v, _f, _random);
         children_indices[side].push_back(j);
+      } else {
+        showUpdate("No node for index %d?\n", j);
       }
     }
 
     // If we didn't find a hyperplane, just randomize sides as a last option
     while (children_indices[0].size() == 0 || children_indices[1].size() == 0) 
{
+      if (_verbose)
+        showUpdate("\tNo hyperplane found (left has %ld children, right has 
%ld children)\n",
+          children_indices[0].size(), children_indices[1].size());
       if (_verbose && indices.size() > 100000)
         showUpdate("Failed splitting %lu items\n", indices.size());
 
@@ -820,9 +1086,10 @@
     int flip = (children_indices[0].size() > children_indices[1].size());
 
     m->n_descendants = is_root ? _n_items : (S)indices.size();
-    for (int side = 0; side < 2; side++)
+    for (int side = 0; side < 2; side++) {
       // run _make_tree for the smallest child first (for cache locality)
       m->children[side^flip] = _make_tree(children_indices[side^flip], false);
+    }
 
     _allocate_size(_n_nodes + 1);
     S item = _n_nodes++;
@@ -832,15 +1099,17 @@
     return item;
   }
 
-  void _get_all_nns(const T* v, size_t n, size_t search_k, vector<S>* result, 
vector<T>* distances) {
+  void _get_all_nns(const T* v, size_t n, size_t search_k, vector<S>* result, 
vector<T>* distances) const {
     Node* v_node = (Node *)malloc(_s); // TODO: avoid
-    memcpy(v_node->v, v, sizeof(T)*_f);
+    D::template zero_value<Node>(v_node);
+    memcpy(v_node->v, v, sizeof(T) * _f);
     D::init_node(v_node, _f);
 
     std::priority_queue<pair<T, S> > q;
 
-    if (search_k == (size_t)-1)
-      search_k = n * _roots.size(); // slightly arbitrary default value
+    if (search_k == (size_t)-1) {
+      search_k = n * _roots.size();
+    }
 
     for (size_t i = 0; i < _roots.size(); i++) {
       q.push(make_pair(Distance::template pq_initial_value<T>(), _roots[i]));
@@ -860,8 +1129,8 @@
         nns.insert(nns.end(), dst, &dst[nd->n_descendants]);
       } else {
         T margin = D::margin(nd, v, _f);
-        q.push(make_pair(D::pq_distance(d, margin, 1), nd->children[1]));
-        q.push(make_pair(D::pq_distance(d, margin, 0), nd->children[0]));
+        q.push(make_pair(D::pq_distance(d, margin, 1), 
static_cast<S>(nd->children[1])));
+        q.push(make_pair(D::pq_distance(d, margin, 0), 
static_cast<S>(nd->children[0])));
       }
     }
 
@@ -876,7 +1145,7 @@
         continue;
       last = j;
       if (_get(j)->n_descendants == 1)  // This is only to guard a really 
obscure case, #284
-       nns_dist.push_back(make_pair(D::distance(v_node, _get(j), _f), j));
+        nns_dist.push_back(make_pair(D::distance(v_node, _get(j), _f), j));
     }
 
     size_t m = nns_dist.size();
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/src/annoymodule.cc 
new/annoy-1.15.1/src/annoymodule.cc
--- old/annoy-1.12.0/src/annoymodule.cc 2018-02-07 02:41:53.000000000 +0100
+++ new/annoy-1.15.1/src/annoymodule.cc 2019-02-21 17:39:33.000000000 +0100
@@ -46,7 +46,7 @@
 private:
   int32_t _f_external, _f_internal;
   AnnoyIndex<int32_t, uint64_t, Hamming, Kiss64Random> _index;
-  void _pack(const float* src, uint64_t* dst) {
+  void _pack(const float* src, uint64_t* dst) const {
     for (int32_t i = 0; i < _f_internal; i++) {
       dst[i] = 0;
       for (int32_t j = 0; j < 64 && i*64+j < _f_external; j++) {
@@ -54,7 +54,7 @@
       }
     }
   };
-  void _unpack(const uint64_t* src, float* dst) {
+  void _unpack(const uint64_t* src, float* dst) const {
     for (int32_t i = 0; i < _f_external; i++) {
       dst[i] = (src[i / 64] >> (i % 64)) & 1;
     }
@@ -68,11 +68,11 @@
   };
   void build(int q) { _index.build(q); };
   void unbuild() { _index.unbuild(); };
-  bool save(const char* filename) { return _index.save(filename); };
+  bool save(const char* filename, bool prefault) { return 
_index.save(filename, prefault); };
   void unload() { _index.unload(); };
-  bool load(const char* filename) { return _index.load(filename); };
-  float get_distance(int32_t i, int32_t j) { return _index.get_distance(i, j); 
};
-  void get_nns_by_item(int32_t item, size_t n, size_t search_k, 
vector<int32_t>* result, vector<float>* distances) {
+  bool load(const char* filename, bool prefault) { return 
_index.load(filename, prefault); };
+  float get_distance(int32_t i, int32_t j) const { return 
_index.get_distance(i, j); };
+  void get_nns_by_item(int32_t item, size_t n, size_t search_k, 
vector<int32_t>* result, vector<float>* distances) const {
     if (distances) {
       vector<uint64_t> distances_internal;
       _index.get_nns_by_item(item, n, search_k, result, &distances_internal);
@@ -81,7 +81,7 @@
       _index.get_nns_by_item(item, n, search_k, result, NULL);
     }
   };
-  void get_nns_by_vector(const float* w, size_t n, size_t search_k, 
vector<int32_t>* result, vector<float>* distances) {
+  void get_nns_by_vector(const float* w, size_t n, size_t search_k, 
vector<int32_t>* result, vector<float>* distances) const {
     vector<uint64_t> w_internal(_f_internal, 0);
     _pack(w, &w_internal[0]);
     if (distances) {
@@ -92,14 +92,16 @@
       _index.get_nns_by_vector(&w_internal[0], n, search_k, result, NULL);
     }
   };
-  int32_t get_n_items() { return _index.get_n_items(); };
+  int32_t get_n_items() const { return _index.get_n_items(); };
+  int32_t get_n_trees() const { return _index.get_n_trees(); };
   void verbose(bool v) { _index.verbose(v); };
-  void get_item(int32_t item, float* v) {
+  void get_item(int32_t item, float* v) const {
     vector<uint64_t> v_internal(_f_internal, 0);
     _index.get_item(item, &v_internal[0]);
     _unpack(&v_internal[0], v);
   };
   void set_seed(int q) { _index.set_seed(q); };
+  bool on_disk_build(const char* filename) { return 
_index.on_disk_build(filename); };
 };
 
 // annoy python object
@@ -129,6 +131,8 @@
     self->ptr = new AnnoyIndex<int32_t, float, Manhattan, 
Kiss64Random>(self->f);
   } else if (!strcmp(metric, "hamming")) {
     self->ptr = new HammingWrapper(self->f);
+  } else if (!strcmp(metric, "dot")) {
+    self->ptr = new AnnoyIndex<int32_t, float, DotProduct, 
Kiss64Random>(self->f);
   } else {
     PyErr_SetString(PyExc_ValueError, "No such metric");
     return NULL;
@@ -145,7 +149,7 @@
   int f;
   static char const * kwlist[] = {"f", "metric", NULL};
   if (!PyArg_ParseTupleAndKeywords(args, kwargs, "i|s", (char**)kwlist, &f, 
&metric))
-    return NULL;
+    return (int) NULL;
   return 0;
 }
 
@@ -168,13 +172,14 @@
 py_an_load(py_annoy *self, PyObject *args, PyObject *kwargs) {
   char* filename;
   bool res = false;
+  bool prefault = false;
   if (!self->ptr) 
     return NULL;
-  static char const * kwlist[] = {"fn", NULL};
-  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s", (char**)kwlist, 
&filename))
+  static char const * kwlist[] = {"fn", "prefault", NULL};
+  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s|b", (char**)kwlist, 
&filename, &prefault))
     return NULL;
 
-  res = self->ptr->load(filename);
+  res = self->ptr->load(filename, prefault);
 
   if (!res) {
     PyErr_SetFromErrno(PyExc_IOError);
@@ -188,13 +193,14 @@
 py_an_save(py_annoy *self, PyObject *args, PyObject *kwargs) {
   char *filename;
   bool res = false;
+  bool prefault = false;
   if (!self->ptr) 
     return NULL;
-  static char const * kwlist[] = {"fn", NULL};
-  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s", (char**)kwlist, 
&filename))
+  static char const * kwlist[] = {"fn", "prefault", NULL};
+  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s|b", (char**)kwlist, 
&filename, &prefault))
     return NULL;
 
-  res = self->ptr->save(filename);
+  res = self->ptr->save(filename, prefault);
 
   if (!res) {
     PyErr_SetFromErrno(PyExc_IOError);
@@ -263,8 +269,16 @@
 
 bool
 convert_list_to_vector(PyObject* v, int f, vector<float>* w) {
+  if (PyObject_Size(v) == -1) {
+    char buf[256];
+    snprintf(buf, 256, "Expected an iterable, got an object of type \"%s\"", 
v->ob_type->tp_name);
+    PyErr_SetString(PyExc_ValueError, buf);
+    return false;
+  }
   if (PyObject_Size(v) != f) {
-    PyErr_SetString(PyExc_IndexError, "Vector has wrong length");
+    char buf[128];
+    snprintf(buf, 128, "Vector has wrong length (expected %d, got %ld)", f, 
PyObject_Size(v));
+    PyErr_SetString(PyExc_IndexError, buf);
     return false;
   }
   for (int z = 0; z < f; z++) {
@@ -350,6 +364,24 @@
   Py_RETURN_NONE;
 }
 
+static PyObject *
+py_an_on_disk_build(py_annoy *self, PyObject *args, PyObject *kwargs) {
+  char *filename;
+  bool res = false;
+  if (!self->ptr)
+    return NULL;
+  static char const * kwlist[] = {"fn", NULL};
+  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s", (char**)kwlist, 
&filename))
+    return NULL;
+
+  res = self->ptr->on_disk_build(filename);
+
+  if (!res) {
+    PyErr_SetFromErrno(PyExc_IOError);
+    return NULL;
+  }
+  Py_RETURN_TRUE;
+}
 
 static PyObject *
 py_an_build(py_annoy *self, PyObject *args, PyObject *kwargs) {
@@ -418,6 +450,14 @@
   return PyInt_FromLong(n);
 }
 
+static PyObject *
+py_an_get_n_trees(py_annoy *self) {
+  if (!self->ptr) 
+    return NULL;
+
+  int32_t n = self->ptr->get_n_trees();
+  return PyInt_FromLong(n);
+}
 
 static PyObject *
 py_an_verbose(py_annoy *self, PyObject *args) {
@@ -454,11 +494,13 @@
   {"get_nns_by_vector",(PyCFunction)py_an_get_nns_by_vector, METH_VARARGS | 
METH_KEYWORDS, "Returns the `n` closest items to vector `vector`.\n\n:param 
search_k: the query will inspect up to `search_k` nodes.\n`search_k` gives you 
a run-time tradeoff between better accuracy and speed.\n`search_k` defaults to 
`n_trees * n` if not provided.\n\n:param include_distances: If `True`, this 
function will return a\n2 element tuple of lists. The first list contains the 
`n` closest items.\nThe second list contains the corresponding distances."},
   {"get_item_vector",(PyCFunction)py_an_get_item_vector, METH_VARARGS, 
"Returns the vector for item `i` that was previously added."},
   {"add_item",(PyCFunction)py_an_add_item, METH_VARARGS | METH_KEYWORDS, "Adds 
item `i` (any nonnegative integer) with vector `v`.\n\nNote that it will 
allocate memory for `max(i)+1` items."},
+  {"on_disk_build",(PyCFunction)py_an_on_disk_build, METH_VARARGS | 
METH_KEYWORDS, "Build will be performed with storage on disk instead of RAM."},
   {"build",(PyCFunction)py_an_build, METH_VARARGS | METH_KEYWORDS, "Builds a 
forest of `n_trees` trees.\n\nMore trees give higher precision when querying. 
After calling `build`,\nno more items can be added."},
   {"unbuild",(PyCFunction)py_an_unbuild, METH_NOARGS, "Unbuilds the tree in 
order to allows adding new items.\n\nbuild() has to be called again afterwards 
in order to\nrun queries."},
   {"unload",(PyCFunction)py_an_unload, METH_NOARGS, "Unloads an index from 
disk."},
   {"get_distance",(PyCFunction)py_an_get_distance, METH_VARARGS, "Returns the 
distance between items `i` and `j`."},
   {"get_n_items",(PyCFunction)py_an_get_n_items, METH_NOARGS, "Returns the 
number of items in the index."},
+  {"get_n_trees",(PyCFunction)py_an_get_n_trees, METH_NOARGS, "Returns the 
number of trees in the index."},
   {"verbose",(PyCFunction)py_an_verbose, METH_VARARGS, ""},
   {"set_seed",(PyCFunction)py_an_set_seed, METH_VARARGS, "Sets the seed of 
Annoy's random number generator."},
   {NULL, NULL, 0, NULL}                 /* Sentinel */
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/src/kissrandom.h 
new/annoy-1.15.1/src/kissrandom.h
--- old/annoy-1.12.0/src/kissrandom.h   2018-02-07 02:41:53.000000000 +0100
+++ new/annoy-1.15.1/src/kissrandom.h   2018-12-07 15:47:32.000000000 +0100
@@ -3,7 +3,7 @@
 
 #if defined(_MSC_VER) && _MSC_VER == 1500
 typedef unsigned __int32    uint32_t;
-typedef unsigned __int32    uint64_t;
+typedef unsigned __int64    uint64_t;
 #else
 #include <stdint.h>
 #endif
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/annoy-1.12.0/src/mman.h new/annoy-1.15.1/src/mman.h
--- old/annoy-1.12.0/src/mman.h 2018-02-07 02:41:53.000000000 +0100
+++ new/annoy-1.15.1/src/mman.h 2019-02-21 17:39:33.000000000 +0100
@@ -85,7 +85,7 @@
     return desiredAccess;
 }
 
-void* mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)
+inline void* mmap(void *addr, size_t len, int prot, int flags, int fildes, 
off_t off)
 {
     HANDLE fm, h;
     
@@ -156,7 +156,7 @@
     return map;
 }
 
-int munmap(void *addr, size_t len)
+inline int munmap(void *addr, size_t len)
 {
     if (UnmapViewOfFile(addr))
         return 0;
@@ -166,7 +166,7 @@
     return -1;
 }
 
-int mprotect(void *addr, size_t len, int prot)
+inline int mprotect(void *addr, size_t len, int prot)
 {
     DWORD newProtect = __map_mmap_prot_page(prot);
     DWORD oldProtect = 0;
@@ -179,7 +179,7 @@
     return -1;
 }
 
-int msync(void *addr, size_t len, int flags)
+inline int msync(void *addr, size_t len, int flags)
 {
     if (FlushViewOfFile(addr, len))
         return 0;
@@ -189,7 +189,7 @@
     return -1;
 }
 
-int mlock(const void *addr, size_t len)
+inline int mlock(const void *addr, size_t len)
 {
     if (VirtualLock((LPVOID)addr, len))
         return 0;
@@ -199,7 +199,7 @@
     return -1;
 }
 
-int munlock(const void *addr, size_t len)
+inline int munlock(const void *addr, size_t len)
 {
     if (VirtualUnlock((LPVOID)addr, len))
         return 0;
@@ -209,4 +209,28 @@
     return -1;
 }
 
+int ftruncate(int fd, unsigned int size) {
+    if (fd < 0) {
+        errno = EBADF;
+        return -1;
+    }
+
+    HANDLE h = (HANDLE)_get_osfhandle(fd);
+    unsigned int cur = SetFilePointer(h, 0, NULL, FILE_CURRENT);
+    if (cur == ~0 || SetFilePointer(h, size, NULL, FILE_BEGIN) == ~0 || 
!SetEndOfFile(h)) {
+        int error = GetLastError();
+        switch (GetLastError()) {
+            case ERROR_INVALID_HANDLE:
+                errno = EBADF;
+                break;
+            default:
+                errno = EIO;
+                break;
+        }
+        return -1;
+    }
+
+    return 0;
+}
+
 #endif 

++++++ reproducible.patch ++++++
--- /var/tmp/diff_new_pack.TGEvdt/_old  2019-03-06 15:52:37.060421179 +0100
+++ /var/tmp/diff_new_pack.TGEvdt/_new  2019-03-06 15:52:37.076421176 +0100
@@ -3,24 +3,35 @@
 
 https://bugzilla.opensuse.org/show_bug.cgi?id=1100677
 
-Index: annoy-1.12.0/setup.py
+Index: annoy-1.15.1/setup.py
 ===================================================================
---- annoy-1.12.0.orig/setup.py
-+++ annoy-1.12.0/setup.py
-@@ -42,15 +42,12 @@ else:
+--- annoy-1.15.1.orig/setup.py
++++ annoy-1.15.1/setup.py
+@@ -36,26 +36,10 @@ with codecs.open('README.rst', encoding=
+     long_description = readme_note + fobj.read()
  
- # Not all CPUs have march as a tuning parameter
- import platform
--cputune = ['-march=native',]
--if platform.machine() == "ppc64le":
--    cputune = ['-mcpu=native',]
- 
- if os.name != 'nt':
-     compile_args = ['-O3', '-ffast-math', '-fno-associative-math']
- else:
-     compile_args = []
--    cputune = []
+ # Various platform-dependent extras
 +cputune = []
+ extra_compile_args = []
+ extra_link_args = []
  
+-if os.environ.get('TRAVIS') == 'true':
+-    # Resolving some annoying issue
+-    extra_compile_args += ['-mno-avx']
+-
+-# Not all CPUs have march as a tuning parameter
+-cputune = ['-march=native',]
+-if platform.machine() == 'ppc64le':
+-    extra_compile_args += ['-mcpu=native',]
+-
+-if os.name != 'nt':
+-    extra_compile_args += ['-O3', '-ffast-math', '-fno-associative-math']
+-
+-# #349: something with OS X Mojave causes libstd not to be found
+-if platform.system() == 'Darwin':
+-    extra_compile_args += ['-std=c++11', '-mmacosx-version-min=10.9']
+-    extra_link_args += ['-stdlib=libc++', '-mmacosx-version-min=10.9']
+-
  setup(name='annoy',
-       version='1.12.0',
+       version='1.15.1',
+       description='Approximate Nearest Neighbors in C++/Python optimized for 
memory usage and loading/saving to disk.',

commit python-annoy for openSUSE:Factory

Reply via email to