I'm not the authoritative voice here, but I would also agree with your interpretation -- it's indices rather than values that I'd use. I can imagine using min-hash on values, but that would not seem to be the most natural thing to do.
(I don't understand the comment about set and get(). Vectors aren't sets, and whether it's sparse or not shouldn't decide whether you want values or indices.) On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <[email protected]> wrote: > I think, if your input vector is a set, the ele.get() should be used, > instead, if your input vector is a sparse vector, the ele.index() would be > used. > > Pls correct me if I'm wrong. > > for (int i = 0; i < numHashFunctions; i++) { > for (Vector.Element ele : featureVector) { > > /// Shouldn't the following line say ele.index(); > int value = (int) ele.get(); > > bytesToHash[0] = (byte) (value >> 24); > bytesToHash[1] = (byte) (value >> 16); > bytesToHash[2] = (byte) (value >> 8); > bytesToHash[3] = (byte) value; > int hashIndex = hashFunction[i].hash(bytesToHash); > if (minHashValues[i] > hashIndex) { > minHashValues[i] = hashIndex; > } > } > } > > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote: > >> I'm having a little trouble understanding Mahout's minhash implementation. >> >> Please correct me if I'm wrong, but the general intent of minhash is to >> evaluate the similarity of two sparse feature vectors based on the features >> they have in common, not necessarily the value of those features (as the >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse >> vector). So given a space of 10 dimensions, if Jack had features 4 and 6 >> and Jill had features 5 and 6, Jacks vector would look something like {4:1, >> 6:1} and Jill's would like like {5:1, 6:1}. Since they have 1/3 total >> features in common, their Jaccard coefficient is 1/3. Also, given K random >> hash functions, we would expect about a third of them to return a minimum >> value for each of the three keys 4, 5 and 6 and thus about a third of them >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the >> third that return a minimum hash value for the key 6). That's my basic >> English explanation of the purpose of minhash -- again, somebody please >> correct me if I'm wrong. >> >> Given that understanding, can somebody explain why Mahout's minhash >> implentation is hasing the values from the feature vectors rather than the >> keys? >> >> See the following code from MinHashMapper.java >> >> for (int i = 0; i < numHashFunctions; i++) { >> for (Vector.Element ele : featureVector) { >> >> /// Shouldn't the following line say ele.index(); >> int value = (int) ele.get(); >> >> bytesToHash[0] = (byte) (value >> 24); >> bytesToHash[1] = (byte) (value >> 16); >> bytesToHash[2] = (byte) (value >> 8); >> bytesToHash[3] = (byte) value; >> int hashIndex = hashFunction[i].hash(bytesToHash); >> if (minHashValues[i] > hashIndex) { >> minHashValues[i] = hashIndex; >> } >> } >> } >> >> The code in TestMinHashClustering also seems to written with the >> expectation >> that minhash should be hashing the values rather than the keys. Am I >> reading something wrong here? Is this the intended use? Are we supposed >> to >> be putting the feature ids into the double value fields of the feature >> vectors? >> >> Thanks, >> Jeff >> > > > > -- >
