EBernhardson has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/365628 )
Change subject: Batch vectorized jaccard calculations ...................................................................... Batch vectorized jaccard calculations When calculating particularly large groups, the worst of which so far is 3500 queries with 1100 unique page ids, the jaccard calculation attempts to grab incredibly excessive(10's of GB) amounts of data. Chunk the calculation into steps that require less than 100MB of data at a time. Change-Id: I702457b3efeb1aa1d01a5accba438b3a84b9695d --- M mjolnir/norm_query.py M mjolnir/test/test_norm_query.py 2 files changed, 41 insertions(+), 10 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/search/MjoLniR refs/changes/28/365628/1 diff --git a/mjolnir/norm_query.py b/mjolnir/norm_query.py index 301ad15..11b9f67 100644 --- a/mjolnir/norm_query.py +++ b/mjolnir/norm_query.py @@ -21,6 +21,14 @@ import requests +def _batch(a, b, size): + assert len(a) == len(b) + idx = 0 + while idx < len(a): + yield a[idx:idx+size], b[idx:idx+size] + idx += size + + def _binary_sim(matrix): """Compute a jaccard similarity matrix. @@ -40,17 +48,31 @@ # The diagonal is offset by -1 because the identity in a similarity # matrix is always 1. r, c = np.tril_indices(matrix.shape[0], -1) - # Use those indices to build two matrices that contains all - # the rows we need to do a similarity comparison on - p1 = matrix[c] - p2 = matrix[r] - # Run the main jaccard calculation - intersection = np.logical_and(p1, p2).sum(1) - union = np.logical_or(p1, p2).sum(1).astype(np.float64) - # Build the result matrix with 1's on the diagonal + + # Particularly large groups can blow out memory usage. Chunk the calculation + # into steps that require no more than ~100MB of memory at a time. + # We have 4 2d intermediate arrays in memory at a given time, plus the + # input and output: + # p1 = max_rows * matrix.shape[1] + # p2 = max_rows * matrix.shape[1] + # intersection = max_rows * matrix.shape[1] * 4 + # union = max_rows * matrix.shape[1] * 8 + # This adds up to: + # memory usage = max_rows * matrix.shape[1] * 14 + mem_limit = 100 * pow(2, 20) + max_rows = mem_limit / (14 * matrix.shape[1]) out = np.eye(matrix.shape[0]) - # Insert the result of our similarity calculation at their original indices - out[c, r] = intersection / union + for c_batch, r_batch in _batch(c, r, max_rows): + # Use those indices to build two matrices that contains all + # the rows we need to do a similarity comparison on + p1 = matrix[c_batch] + p2 = matrix[r_batch] + # Run the main jaccard calculation + intersection = np.logical_and(p1, p2).sum(1) + union = np.logical_or(p1, p2).sum(1).astype(np.float64) + # Build the result matrix with 1's on the diagonal + # Insert the result of our similarity calculation at their original indices + out[c_batch, r_batch] = intersection / union # Above only populated half of the matrix, the mirrored diagonal contains # only zeros. Fix that up by transposing. Adding the transposed matrix double # counts the diagonal so subtract that back out. We could skip this step and @@ -76,6 +98,11 @@ all_page_ids = list(set([page_id for row in source for page_id in row.hit_page_ids])) n_rows = len(source) n_columns = len(all_page_ids) + # No hits? something odd ... but ok. Return a unique + # group for each query. + if n_columns == 0: + return zip([row.query for row in source], range(n_rows)) + # Build up a matrix that has a unique query # for each row, and the columns are all known page ids. Cells are set # to True if that page id was shown for that query. diff --git a/mjolnir/test/test_norm_query.py b/mjolnir/test/test_norm_query.py index 3f2d3b2..b8962a7 100644 --- a/mjolnir/test/test_norm_query.py +++ b/mjolnir/test/test_norm_query.py @@ -45,6 +45,10 @@ ([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]], [0, 0, 0]), # first item doesn't overlap enough to group with the other two ([[2, 3, 6, 7, 8], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]], [0, 1, 1]), + # One item no hits + ([[1, 2], []], [0, 1]), + # All items no hits + ([[], []], [0, 1]), ]) def test_make_query_groups(hits, expected): row = namedtuple('row', ('query', 'hit_page_ids')) -- To view, visit https://gerrit.wikimedia.org/r/365628 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I702457b3efeb1aa1d01a5accba438b3a84b9695d Gerrit-PatchSet: 1 Gerrit-Project: search/MjoLniR Gerrit-Branch: master Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits