Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Minhash Clustering (https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering)
Edited by John Lee: --------------------------------------------------------------------- Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor types of queries efficiently. The algorithm is decribed in Broder, Andrei Z * [Wiki Markup|https://cwiki.apache.org/confluence/pages/editpage.action?pageId=27826407#] .(1997), "On the resemblance and containment of documents" There is a MinHashDriver class which works in the TestMinHashClustering unit test. This is not included in the standard driver.props class, but it can be run by specifying the full package name. h5. Running Minhash Invocation using the command line takes the form: bin/mahout minhash \ \-input <input vectors directory> \ \--output <output> \--minClusterSize <minClusterSize> \--minVectorSize <minVectorSize> \--hashType <hashType> \--numHashFunctions <numHashFunctions> \--keyGroups <keyGroups> \--numReducers <numReducers> \--debugOutput \--overwrite \--help \--tempDir <tempDir> \--startPhase <startPhase> \--endPhase <endPhase>\] Job-Specific Options: \--input (-i) input Path to job input directory. \--output (-o) output The directory pathname for output. \--minClusterSize (-mcs) minClusterSize Minimum points inside a cluster \--minVectorSize (-mvs) minVectorSize Minimum size of vector to be hashed \--hashType (-ht) hashType Type of hash function to use. Available types: (linear, polynomial, murmur) \--numHashFunctions (-nh) numHashFunctions Number of hash functions to be used \--keyGroups (-kg) keyGroups Number of key groups to be used \--numReducers (-r) numReducers The number of reduce tasks. Defaults to 2 \--debugOutput (-debug) Output the whole vectors for debugging \--overwrite (-ow) If present, overwrite the output directory before running job \--help (-h) Print out help \--tempDir tempDir Intermediate output directory \--startPhase startPhase First phase to run \--endPhase endPhase Last phase to run Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
