I've also tried r787776 on Hadoop 0.19.1, I get a NoClassDefFoundError for com/google/gson/reflect/TypeToken. I'm pretty sure this is the same error I was seeing when trying 793689 against Hadoop 0.20.0.

I've checked the mahout-*-examples.job file and the lib directory does contain gson-1.3.jar which does contain TypeToken.class at com/google/ gson/reflect so not too sure what's happening.

On 14 Jul 2009, at 13:23, Paul Ingles wrote:

I noticed it was using 0.20.0 this morning and gave it a go. I think it failed at the Clustering phases with a NoClassDef error for the GSon stuff, but I don't remember exactly.

I'm running from an earlier revision against 0.19 at the moment, but will try 0.20 again when it's finished and let you know how it goes.

Thanks again,
Paul

On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:

Try Hadoop 0.20.0, which is what trunk is now on. I will update the docs.


On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:

Hi,

I've been going over the kmeans stuff the last few days to try and understand how it works, and how I might extend it to work with the data I'm looking to process. It's taken me a while to get a basic understanding of things, and really appreciate having lists like this around for support.

I need to be able to label the vectors: each vector holds (for a document) a set of similarity scores across a number of attributes. I did some searching around payloads (after coming across the term in some comments) but couldn't see how I add a payload to the Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 ) that mentions the addition of the setName method to Vector. I've tried building trunk, and although there were a few test failures for other (seemingly unrelated) examples I continued and managed to get the mahout-examples jar/job files built to give it a whirl.

When I run the following:

$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

I see it run the "Preparing Input", "Running Canopy to get initial clusters", and then finally it starts "Running KMeans". But, shortly after it breaks with the following trace:

---snip---
Running KMeans
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/canopies Out: output Distance: org.apache.mahout.utils.EuclideanDistanceMeasure 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 1 Input Vectors: org.apache.mahout.matrix.SparseVector
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to process : 2 09/07/13 23:49:34 INFO mapred.JobClient: Running job: job_200907132019_0040
09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
09/07/13 23:49:50 INFO mapred.JobClient: Job complete: job_200907132019_0040
09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient: Local bytes written=15674
09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient: Combine output records=10
09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient: Combine input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException: Cannot open filename /user/paul/output/clusters-0/_logs java.io.IOException: Cannot open filename /user/paul/output/ clusters-0/_logs at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.openInfo(DFSClient.java:1394) at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.<init>(DFSClient.java:1385)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
at org .apache .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java: 171) at org.apache.hadoop.io.SequenceFile $Reader.openFile(SequenceFile.java:1437) at org.apache.hadoop.io.SequenceFile $Reader.<init>(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile $Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile $Reader.<init>(SequenceFile.java:1412) at org .apache .mahout .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304) at org .apache .mahout .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241) at org .apache .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194) at org .apache .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100) at org .apache .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:56)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun .reflect .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---

This is against revision 793689, running on my development Mac Pro (pseudo-distributed single node) with Hadoop 0.19.1.

It's a bit late to be digging through what's going on, but will try and take a look tomorrow- really excited about giving kmeans a whirl on the document processing I'm playing with. In the meantime, I was wondering whether anyone else had seen the same, or knew a way to accomplish something similar with the released version (or point me to a past good revision perhaps?)

Thanks again,
Paul

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search



Reply via email to