[ https://issues.apache.org/jira/browse/MAHOUT-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Eastman resolved MAHOUT-766. --------------------------------- Resolution: Not A Problem Fix Version/s: 0.6 Assignee: Jeff Eastman I think the problem here is using the default distance measure (EuclideanSquared) with fuzzyk. I added -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ to the script and it produced clusters that differ somewhat from each other but still have a high degree of similarity in their terms and weights. Then I decreased m to 1.1 and, predictably, the clusters diverged to be more like the kmeans results. It does seem like there is a lot of sensitivity to the values of m and the range 1 < m <= 2 has a large impact on the clusters. I'm going to resolve this as not a problem. > fuzzy kmeans - all cluster with the same top terms > --------------------------------------------------- > > Key: MAHOUT-766 > URL: https://issues.apache.org/jira/browse/MAHOUT-766 > Project: Mahout > Issue Type: Bug > Components: Clustering, Examples > Affects Versions: 0.6 > Environment: tested in OSX and linux > Reporter: Paulo Magalhaes > Assignee: Jeff Eastman > Fix For: 0.6 > > > believe there is something wrong with fkmeans in trunk. > I am using code from trunk (last checkout 6/30/11). To recreate is very > simple: > 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2 > 2) run build-reuters.sh > 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt sequencefile > -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o > ./reuters-clusterdump.txt -d > ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 > here is what the clusters look like: > SV-15898{n=34 c=[0:0.020, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.7254762602900604 > mln => 1.2510936664951733 > dlrs => 1.1340145215097008 > 3 => 1.0643797240793276 > pct => 1.0422760712239152 > reuter => 1.0202689935247569 > its => 0.9997771992646881 > from => 0.9903731234557381 > year => 0.8855389859684145 > vs => 0.8291746545786391 > :SV-14766{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6406710289350412 > mln => 1.2174993414858022 > dlrs => 1.0937941570322955 > 3 => 1.0334420773050856 > pct => 0.991539915235039 > reuter => 0.990042452019326 > its => 0.9508638527143669 > from => 0.9403885495991262 > vs => 0.865437130369746 > year => 0.8463503194752994 > :SV-14854{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.641260962665307 > mln => 1.217806578134094 > dlrs => 1.0941157210136143 > 3 => 1.0336934328877394 > pct => 0.991895013999163 > reuter => 0.9902889592990656 > its => 0.9512076670014483 > from => 0.9407384847445094 > vs => 0.8653426311034671 > year => 0.8466407590692175 > :SV-14890{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6410352907185948 > mln => 1.21769021136256 > dlrs => 1.0939933408434481 > 3 => 1.0335977297579235 > pct => 0.991759193577722 > reuter => 0.9901951250301172 > its => 0.9510761761632947 > from => 0.9406047832581563 > vs => 0.8653814488835572 > year => 0.8465301083353372 > :SV-14972{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.640981249652196 > mln => 1.2176595452829564 > dlrs => 1.093962519439548 > 3 => 1.0335737897463568 > pct => 0.9917266257955816 > reuter => 0.9901715950801396 > its => 0.9510446208123859 > from => 0.9405723357372776 > vs => 0.8653843699725567 > year => 0.846502466267153 > :SV-15023{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6399319888551425 > mln => 1.217099157115808 > dlrs => 1.0933830369192543 > 3 => 1.033121271434882 > pct => 0.991094828319561 > reuter => 0.9897275313905611 > its => 0.9504327303592046 > from => 0.9399480272494183 > vs => 0.8655203514280634 > year => 0.8459804922897428 > :SV-15330{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6411480082558068 > mln => 1.217746071140758 > dlrs => 1.0940532425506244 > 3 => 1.0336447143638317 > pct => 0.9918269975797083 > reuter => 0.990241145450359 > its => 0.9511417993006985 > from => 0.9406712099799636 > vs => 0.8653569180999117 > year => 0.8465844425179013 > :SV-15403{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6493270418577013 > mln => 1.221708475489808 > dlrs => 1.0983489300320377 > 3 => 1.0370024996153944 > pct => 0.9967446058994232 > reuter => 0.993528974793619 > its => 0.9558988111209523 > from => 0.9454911460774864 > vs => 0.8633642497287671 > year => 0.8505083085439775 > :SV-15514{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6414524586689534 > mln => 1.2179029815366167 > dlrs => 1.094218299808865 > 3 => 1.033773769117182 > pct => 0.9920102286561391 > reuter => 0.9903676795676004 > its => 0.9513191861395162 > from => 0.9408515920762511 > vs => 0.865304353452142 > year => 0.8467337135094862 > :SV-15549{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.640632892454694 > mln => 1.2174764812983898 > dlrs => 1.0937717467869699 > 3 => 1.033424727632325 > pct => 0.99151691360307 > reuter => 0.9900253758026865 > its => 0.9508415534060888 > from => 0.9403654699584985 > vs => 0.865436402399392 > year => 0.8463303217162843 > :SV-15616{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6402745961421197 > mln => 1.217287104215781 > dlrs => 1.0935749393200054 > 3 => 1.0332709291683844 > pct => 0.9913012005612369 > reuter => 0.9898744911012118 > its => 0.9506326562835085 > from => 0.9401525895225771 > vs => 0.8654873596392523 > year => 0.8461528918952358 > :SV-15674{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6402335213893247 > mln => 1.2172651791725515 > dlrs => 1.0935522610806727 > 3 => 1.0332532137000938 > pct => 0.991276468108388 > reuter => 0.9898571070574692 > its => 0.9506087026962596 > from => 0.9401281555632803 > vs => 0.8654927058873914 > year => 0.8461324681573653 > :SV-15720{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.641454220566282 > mln => 1.2179063418879368 > dlrs => 1.0942205822099829 > 3 => 1.0337754035575257 > pct => 0.9920113271819195 > reuter => 0.9903693325123661 > its => 0.9513202705619623 > from => 0.9408530174807668 > vs => 0.8653096216062077 > year => 0.8467355860669477 > :SV-15732{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6418679366988789 > mln => 1.218118262616823 > dlrs => 1.0944441677361394 > 3 => 1.0339502052648608 > pct => 0.9922602967957669 > reuter => 0.9905406967751569 > its => 0.9515612774046113 > from => 0.941098001639954 > vs => 0.865235154416334 > year => 0.8469379811534101 > :SV-15825{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6403540331112847 > mln => 1.2173302824011656 > dlrs => 1.0936192179118565 > 3 => 1.0333054698476525 > pct => 0.9913490440255205 > reuter => 0.9899084014354236 > its => 0.9506790000021428 > from => 0.9401999656754023 > vs => 0.8654787849286104 > year => 0.8461927112339609 > :SV-15888{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.641852069569193 > mln => 1.218106579705691 > dlrs => 1.0944336674208315 > 3 => 1.0339422184421034 > pct => 0.9922506923700831 > reuter => 0.9905327937543529 > its => 0.951551949990525 > from => 0.9410880514065464 > vs => 0.8652299423273659 > year => 0.8469287549740471 > :SV-15944{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6406094746503062 > mln => 1.2174640910103491 > dlrs => 1.0937588768380255 > 3 => 1.0334146735611798 > pct => 0.9915028147402405 > reuter => 0.9900155118531778 > its => 0.9508279001565995 > from => 0.9403515526055797 > vs => 0.865439705916966 > year => 0.846318717539638 > :SV-15952{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.641608350634413 > mln => 1.2179827157677379 > dlrs => 1.094302484756082 > 3 => 1.033839606583586 > pct => 0.9921040410110572 > reuter => 0.990432219413613 > its => 0.9514099986904929 > from => 0.9409438763575203 > vs => 0.8652760331837802 > year => 0.8468099163160301 > :SV-15954{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6429205353451672 > mln => 1.2186434984636658 > dlrs => 1.0950054459143779 > 3 => 1.0343894404834142 > pct => 0.992893505149969 > reuter => 0.9909710261706427 > its => 0.9521740690117075 > from => 0.9417194634871013 > vs => 0.8650137662755684 > year => 0.8474476266423354 > :SV-16007{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.6401767760282457 > mln => 1.2172339691485916 > dlrs => 1.093520432998812 > 3 => 1.0332284013507513 > pct => 0.9912422858233993 > reuter => 0.9898327402827573 > its => 0.9505755879363272 > from => 0.9400942591120444 > vs => 0.8654979916098049 > year => 0.8461038772989482 > :SV-16037{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, > 0.02:0.002, 0.03:0.001, 0.046:0.0 > Top Terms: > said => 1.640610618380475 > mln => 1.2174645746382695 > dlrs => 1.0937594396319776 > 3 => 1.0334151203058977 > pct => 0.9915035014016228 > reuter => 0.9900159476830741 > its => 0.9508285640147016 > from => 0.9403522136131415 > vs => 0.8654392679742507 > year => 0.846319234572972 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira