#clojure #fkmeans - Clustering of Test Data Failed

Jeffrey Mon, 12 Sep 2011 00:16:40 -0700

Hi,

I have a test data that has a number of points, written to a sequence file 
using a Clojure script as follows (I am equally just as bad in both JAVA and 
Clojure, since I really don't like JAVA I wrote my scripts in Clojure whenever 
possible).


    #!./bin/clj
    (ns sensei.sequence.core)

    (require 'clojure.string)
    (require 'clojure.java.io)

    (import org.apache.hadoop.conf.Configuration)
    (import org.apache.hadoop.fs.FileSystem)
    (import org.apache.hadoop.fs.Path)
    (import org.apache.hadoop.io.SequenceFile)
    (import org.apache.hadoop.io.Text)

    (import org.apache.mahout.math.VectorWritable)
    (import org.apache.mahout.math.SequentialAccessSparseVector)

    (with-open [reader (clojure.java.io/reader *in*)]
      (let [hadoop_configuration ((fn []
                                    (let [conf (new Configuration)]
                                      (. conf set "fs.default.name" 
"hdfs://localhost:9000/")
                                      conf)))
            hadoop_fs (FileSystem/get hadoop_configuration)]
        (reduce
          (fn [writer [index value]]
            (. writer append index value)
            writer)
          (SequenceFile/createWriter
            hadoop_fs
            hadoop_configuration
            (new Path "test/sensei")
            Text
            VectorWritable)
          (map
            (fn [[tag row_vector]]
              (let [input_index (new Text tag)
                    input_vector (new VectorWritable)]
                (. input_vector set row_vector)
                [input_index input_vector]))
            (map
              (fn [[tag photo_list]]
                (let [photo_map (apply hash-map photo_list)
                      input_vector (new SequentialAccessSparseVector (count 
(vals photo_map)))]
                  (loop [frequency_list (vals photo_map)]
                    (if (zero? (count frequency_list))
                      [tag input_vector]
                      (when-not (zero? (count frequency_list))
                        (. input_vector set
                           (mod (count frequency_list) (count (vals photo_map)))
                           (Integer/parseInt (first frequency_list)))
                        (recur (rest frequency_list)))))))
              (reduce
                (fn [result next_line]
                  (let [[tag photo frequency] (clojure.string/split next_line 
#" ")]
                    (update-in result [tag]
                      #(if (nil? %)
                         [photo frequency]
                         (conj % photo frequency)))))
                {}
                (line-seq reader)))))))

Basically the script receives input (from stdin) in this format

    tag_uri image_uri count

e.g.

    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/13980928@N03/6001200971 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/21207178@N07/5441742937 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/25845846@N06/3033371575 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/30366924@N08/5772100510 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/31343451@N00/5957189406 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/36662563@N00/4815218552 1
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/38583880@N00/5686968462 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/43335486@N00/5794673203 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/46857830@N03/5651576112 0
    http://flickr.com/photos/tags/ísland 
http://flickr.com/photos/99996011@N00/5396566822 0

Then turn them into sequence file with each entry represents one point (10 
dimensions in this example) with key set to tag_uri 
<http://flickr.com/photos/tags/ísland> and value set to point described by the 
frequency vector (0 0 0 0 0 1 0 0 0 0)

I then use a script (available in 2 different variations) to send the data in 
as a clustering job, however I am getting error that I don't know how this can 
be fixed. It seems that something is wrong with the initial cluster.

Script variation 1

    #!./bin/clj

    (ns sensei.clustering.fkmeans)

    (import org.apache.hadoop.conf.Configuration)
    (import org.apache.hadoop.fs.Path)

    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)

    (let [hadoop_configuration ((fn []
                                    (let [conf (new Configuration)]
                                      (. conf set "fs.default.name" 
"hdfs://localhost:9000/")
                                      conf)))
          driver (new FuzzyKMeansDriver)]
      (. driver setConf hadoop_configuration)
      (. driver
         run
         (into-array String ["--input" "test/sensei"
                             "--output" "test/clusters"
                             "--clusters" "test/clusters/clusters-0"
                             "--clustering"
                             "--overwrite"
                             "--emitMostLikely" "false"
                             "--numClusters" "3"
                             "--maxIter" "10"
                             "--m" "5"])))

Script variation 2:

    #!./bin/clj

    (ns sensei.clustering.fkmeans)

    (import org.apache.hadoop.conf.Configuration)
    (import org.apache.hadoop.fs.Path)

    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)

    (let [hadoop_configuration ((fn []
                                    (let [conf (new Configuration)]
                                      (. conf set "fs.default.name" 
"hdfs://127.0.0.1:9000/")
                                      conf)))
          input_path (new Path "test/sensei")
          output_path (new Path "test/clusters")
          clusters_in_path (new Path "test/clusters/cluster-0")]
      (FuzzyKMeansDriver/run
        hadoop_configuration
        input_path
        (RandomSeedGenerator/buildRandom
          hadoop_configuration
          input_path
          clusters_in_path
          (int 2)
          (new EuclideanDistanceMeasure))
        output_path
        (new EuclideanDistanceMeasure)
        (double 0.5)
        (int 10)
        (float 5.0)
        true
        false
        (double 0.0)
        false)) '' runSequential

I am getting the same error with both variations

    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
    11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor
    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor
    11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
    11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to process 
: 1
    11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001
    11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100
    11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720
    11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680
    11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001
    java.lang.IllegalStateException: No clusters found. Check your -c path.
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62)
            at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
            at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
    11/08/25 15:20:18 INFO mapred.JobClient:  map 0% reduce 0%
    11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001
    11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0
    Exception in thread "main" java.lang.RuntimeException: 
java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing 
test/clusters/cluster-0/part-randomSeed
            at clojure.lang.Util.runtimeException(Util.java:153)
            at clojure.lang.Compiler.eval(Compiler.java:6417)
            at clojure.lang.Compiler.load(Compiler.java:6843)
            at clojure.lang.Compiler.loadFile(Compiler.java:6804)
            at clojure.main$load_script.invoke(main.clj:282)
            at clojure.main$script_opt.invoke(main.clj:342)
            at clojure.main$main.doInvoke(main.clj:426)
            at clojure.lang.RestFn.invoke(RestFn.java:436)
            at clojure.lang.Var.invoke(Var.java:409)
            at clojure.lang.AFn.applyToHelper(AFn.java:167)
            at clojure.lang.Var.applyTo(Var.java:518)
            at clojure.main.main(main.java:37)
    Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration failed 
processing test/clusters/cluster-0/part-randomSeed
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252)
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421)
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345)
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
            at clojure.lang.Compiler.eval(Compiler.java:6406)
            ... 10 more

Notice there is a runSequential flag for the 2nd variation, if I set it to true

    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
    11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor
    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor
    Exception in thread "main" java.lang.IllegalStateException: Clusters is 
empty!
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361)
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343)
            at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
            at clojure.lang.Compiler.eval(Compiler.java:6465)
            at clojure.lang.Compiler.load(Compiler.java:6902)
            at clojure.lang.Compiler.loadFile(Compiler.java:6863)
            at clojure.main$load_script.invoke(main.clj:282)
            at clojure.main$script_opt.invoke(main.clj:342)
            at clojure.main$main.doInvoke(main.clj:426)
            at clojure.lang.RestFn.invoke(RestFn.java:436)
            at clojure.lang.Var.invoke(Var.java:409)
            at clojure.lang.AFn.applyToHelper(AFn.java:167)
            at clojure.lang.Var.applyTo(Var.java:518)
            at clojure.main.main(main.java:37)

Now, if I cluster the data using the CLI tool, it will complete without error

    $ bin/mahout fkmeans --input test/sensei --output test/clusters --clusters 
test/clusters/clusters-0 --clustering --overwrite --emitMostLikely false 
--numClusters 10 --maxIter 10 --m 5

However, even there is this option: --clustering, I am not seeing any points in 
the cluster dump generated with this command

    $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 
--pointsDir test/clusters/clusteredPoints --output sensei.txt

And yeah, the command completed without any error too.

... been stuck with this problem over and over again for months, and I can't 
still get the clustering done properly :(

Best wishes,
Jeffrey04

#clojure #fkmeans - Clustering of Test Data Failed

Reply via email to