Hi There, i want to make a code to run few DNA seg. so that i will be able to find similarity in them. file are in million as well as seq. are large so i tried developing program but fails in it i think minhash and lsh can able to solve my problem. i need kind of program that will be easy to handle.
from scipy.spatial.distance import cosine from random import randint import numpy as np N = 128 max_val = (2**32)-1 perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)] vec = [float('inf') for i in range(N)] def minhash(s, prime=4294967311): ''' Given a set `s`, pass each member of the set through all permutation functions, and set the `ith` position of `vec` to the `ith` permutation function's output if that output is smaller than `vec[i]`. ''' vec = [float('inf') for i in range(N)] for val in s: if not isinstance(val, int): val = hash(val) for perm_idx, perm_vals in enumerate(perms): a, b = perm_vals output = (a * val + b) % prime if vec[perm_idx] > output: vec[perm_idx] = output return vec _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor