EBernhardson has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/365628 )

Change subject: Batch vectorized jaccard calculations
......................................................................

Batch vectorized jaccard calculations

When calculating particularly large groups, the worst of which
so far is 3500 queries with 1100 unique page ids, the jaccard
calculation attempts to grab incredibly excessive(10's of GB)
amounts of data. Chunk the calculation into steps that require
less than 100MB of data at a time.

Change-Id: I702457b3efeb1aa1d01a5accba438b3a84b9695d
---
M mjolnir/norm_query.py
M mjolnir/test/test_norm_query.py
2 files changed, 41 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/MjoLniR 
refs/changes/28/365628/1

diff --git a/mjolnir/norm_query.py b/mjolnir/norm_query.py
index 301ad15..11b9f67 100644
--- a/mjolnir/norm_query.py
+++ b/mjolnir/norm_query.py
@@ -21,6 +21,14 @@
 import requests
 
 
+def _batch(a, b, size):
+    assert len(a) == len(b)
+    idx = 0
+    while idx < len(a):
+        yield a[idx:idx+size], b[idx:idx+size]
+        idx += size
+
+
 def _binary_sim(matrix):
     """Compute a jaccard similarity matrix.
 
@@ -40,17 +48,31 @@
     # The diagonal is offset by -1 because the identity in a similarity
     # matrix is always 1.
     r, c = np.tril_indices(matrix.shape[0], -1)
-    # Use those indices to build two matrices that contains all
-    # the rows we need to do a similarity comparison on
-    p1 = matrix[c]
-    p2 = matrix[r]
-    # Run the main jaccard calculation
-    intersection = np.logical_and(p1, p2).sum(1)
-    union = np.logical_or(p1, p2).sum(1).astype(np.float64)
-    # Build the result matrix with 1's on the diagonal
+
+    # Particularly large groups can blow out memory usage. Chunk the 
calculation
+    # into steps that require no more than ~100MB of memory at a time.
+    # We have 4 2d intermediate arrays in memory at a given time, plus the
+    # input and output:
+    #  p1 = max_rows * matrix.shape[1]
+    #  p2 = max_rows * matrix.shape[1]
+    #  intersection = max_rows * matrix.shape[1] * 4
+    #  union = max_rows * matrix.shape[1] * 8
+    # This adds up to:
+    #  memory usage = max_rows * matrix.shape[1] * 14
+    mem_limit = 100 * pow(2, 20)
+    max_rows = mem_limit / (14 * matrix.shape[1])
     out = np.eye(matrix.shape[0])
-    # Insert the result of our similarity calculation at their original indices
-    out[c, r] = intersection / union
+    for c_batch, r_batch in _batch(c, r, max_rows):
+        # Use those indices to build two matrices that contains all
+        # the rows we need to do a similarity comparison on
+        p1 = matrix[c_batch]
+        p2 = matrix[r_batch]
+        # Run the main jaccard calculation
+        intersection = np.logical_and(p1, p2).sum(1)
+        union = np.logical_or(p1, p2).sum(1).astype(np.float64)
+        # Build the result matrix with 1's on the diagonal
+        # Insert the result of our similarity calculation at their original 
indices
+        out[c_batch, r_batch] = intersection / union
     # Above only populated half of the matrix, the mirrored diagonal contains
     # only zeros. Fix that up by transposing. Adding the transposed matrix 
double
     # counts the diagonal so subtract that back out. We could skip this step 
and
@@ -76,6 +98,11 @@
     all_page_ids = list(set([page_id for row in source for page_id in 
row.hit_page_ids]))
     n_rows = len(source)
     n_columns = len(all_page_ids)
+    # No hits? something odd ... but ok. Return a unique
+    # group for each query.
+    if n_columns == 0:
+        return zip([row.query for row in source], range(n_rows))
+
     # Build up a matrix that has a unique query
     # for each row, and the columns are all known page ids. Cells are set
     # to True if that page id was shown for that query.
diff --git a/mjolnir/test/test_norm_query.py b/mjolnir/test/test_norm_query.py
index 3f2d3b2..b8962a7 100644
--- a/mjolnir/test/test_norm_query.py
+++ b/mjolnir/test/test_norm_query.py
@@ -45,6 +45,10 @@
     ([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]], [0, 0, 0]),
     # first item doesn't overlap enough to group with the other two
     ([[2, 3, 6, 7, 8], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]], [0, 1, 1]),
+    # One item no hits
+    ([[1, 2], []], [0, 1]),
+    # All items no hits
+    ([[], []], [0, 1]),
 ])
 def test_make_query_groups(hits, expected):
     row = namedtuple('row', ('query', 'hit_page_ids'))

-- 
To view, visit https://gerrit.wikimedia.org/r/365628
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I702457b3efeb1aa1d01a5accba438b3a84b9695d
Gerrit-PatchSet: 1
Gerrit-Project: search/MjoLniR
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to