Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Minhash Clustering
(https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering)
Edited by John Lee:
---------------------------------------------------------------------
Minhash clustering performs probabilistic dimension reduction of high
dimensional data. The essence of the technique is to hash each item using
multiple independent hash functions such that the probability of collision of
similar items is higher. Multiple such hash tables can then be constructed to
answer near neighbor types of queries efficiently.
The algorithm is decribed in
Broder, Andrei Z.(1997), "On the resemblance and containment of documents"
There is a MinHashDriver class which works in the TestMinHashClustering unit
test. This is not included in the standard driver.props class, but it can be
run by specifying the full package name.
h5. Running Minhash
Invocation using the command line takes the form:
bin/mahout minhash \
\-input (-i) <input vectors directory> \
\-output (-o) <output file> \
\--minClusterSize (-mcs) <Minimum points inside a cluster> \
\--minVectorSize (-mvs) <Minimum size of vector to be hashed> \
\--hashType (-ht) <Type of hash function \{linear, polynomial, murmur\}>
\
\--numHashFunctions (-nh) <Number of hash functions to be used> \
\--keyGroups (-kg) <Number of key groups to be used> \
\--numReducers (-r) <Number of reduce tasks (default = 2)> \
\--debugOutput (-debug) \
\--overwrite (-ow) <if specified, it overwrites directory>\
\--help -h \
\--tempDir <Intermediate output directory> \
\--startPhase <startPhase> \
\--endPhase <endPhase> \
You can also run minhash as:
mahout org.apache.mahout.clustering.minhash.MinHashDriver ...
Output is a file with:
[Hash Value] filename
Note that there will be duplicates, and with -nh 2, the filename will appear
twice.
For example, if run with -nh 2 and --k 2, the file will contain:
[Hash Value#1]-[Hash Value#2] /[filename]
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action