[ https://issues.apache.org/jira/browse/MAHOUT-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi updated MAHOUT-996: --------------------------------- Fix Version/s: (was: Backlog) 0.9 Assignee: Suneel Marthi (was: Sebastian Schelter) Recent fix for Mahout-1410 addresses this issue, hence marking this as 'Resolved'. > Support NamedVectors in arff.vector job by convention > ----------------------------------------------------- > > Key: MAHOUT-996 > URL: https://issues.apache.org/jira/browse/MAHOUT-996 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.7 > Environment: OS X > Reporter: Andrew Harbick > Assignee: Suneel Marthi > Priority: Minor > Fix For: 0.9 > > Attachments: forillustration.patch > > > If you do something like: > MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff > --dictOut file.bindings --output $PWD > MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc > --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 > --maxIter 1000 --clustering > MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir > $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output > $PWD/output/clusteranalyze.txt > Currently you don't get any information out of clusterdump that helps you > identify which element from your source data is in which cluster. > I did an patch for illustration of using an attribute (by convention) from > the ARFF file as the name for a NamedVector. The result of clusterdump is > much easier to use: > VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, > 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, > 7.953, 1.988, 0.352]} > Weight : [props - optional]: Point: > 1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, > 1.000, 11.000, 12.000, 6.000, 2.000] > 1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, > 1.000, 8.000, 17.000, 6.000, 2.000] > 1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, > 1.000, 21.000, 21.000, 2.000, 2.000] > 1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, > 3:3.000, 5:1.000, 6:4.000, 9:1.000] > 1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, > 1.000, 15.000, 7.000, 3.000, 2.000] > 1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, > 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000] > 1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, > 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000] > ... > I haven't done serious Java in 15 years so the attached patch is just for > idea sake... > Thanks, > Andy -- This message was sent by Atlassian JIRA (v6.1.5#6160)