I'm a mahout novice trying to do some semantic data clustering with
Canopy clustering on some low-dimensional SequenceFiles that I
vectorized with ad-hoc java code. (Some features are strings
vextorized by the Levenstein distance from a constant, some are
DateTime objects vectorized as milliseconds from the Unix era, some
are georeferences, etc. etc.).  The results look promising, but I want
to get more detail out of the clusters, than I understand how to get
from ClusterDumper alone.  In particulary, it seems that
CSVClusterWriter should get me what I need (for each cluster, the
center and the list of vectors ordered by distance.

When I vectorized, I never explicitly built a Dictionary, which is---I
suppose---why I get a runtime ClassCastException when I invoke
ClusterDumper.readPoints(...), despite telling the ClusterDumper run
method that the dictionary type is "sequencefile", but have no
sequencefile to offer.

So I have these questions:
1. Am I right that the Exception in the dumper is caused by not having
a Dictionary file?
2. Where can I find documentation for the correct form of a
sequencefile Dictionary and are there any convenience methods for
building it? I start with a CSV file for the data, together with a Map
that associates column header names with a private type name that
specifies the algorithm to be applied to the vectorization.) I can
send the vectorization if helpful.

Thanks in advance;
--Bob
 Here's the dumper code with point of ClassCastException indicated

public void test() throws Exception {
   String datasetDir = "Lichen/"; // bbg, Rubiaeceae/ or fungi/ for now
   String inputFile =  "/tmp/vectors"; //inputDir + "vectors"; // input,
   String canopyOutput = "/tmp/clusters";
   String dumperInput = canopyOutput+"/clusters-0-final";
   String dumperOutput = "/tmp/clusters.txt";
   String clusterInput = dumperInput+"/"+"part-r-00000";
   String clusterOutput = "/tmp/clusterDetail.txt";
   boolean runSequential = true;
   try {
      String[] args = {"-i", inputFile, "-o", canopyOutput, "-t1",
".00000002", "-t2", ".00000001",  "-ow"};

      CanopyDriver driver = new CanopyDriver();
      driver.run(args);
      //must need Path to the sequence file here also?
      String[] dumpArgs = {"-i", dumperInput, "-o", dumperOutput,
"-dt", "sequencefile"};
      ClusterDumper dumper = new ClusterDumper();
      dumper.run(dumpArgs);

      PrintWriter writer = new PrintWriter(new File(clusterOutput));
      Path pointsPathDir = new Path(dumperInput);
      Configuration conf=new Configuration();
       ////// Line below throws runtime
                                                  ////
       //////   java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable////
      // Presumably need a Dictionary to pass via -d to ClusterDumper
      Map<Integer, List<WeightedPropertyVectorWritable>>  clusterIdToPoints =
         ClusterDumper.readPoints(pointsPathDir, 10000, conf);
     //TODO: iterate over Map and output with CSVClusterWriter
csvClusterWriter = new CSVClusterWriter(writer, clusterIdToPoints,
measure);

   } catch (Exception e) {
       System.out.println("test caught Exception");
       e.printStackTrace(System.out);
   }
}



-- 
Robert A. Morris

Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390


Filtered Push Project
Harvard University Herbaria
Harvard University

email: morris....@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

Reply via email to