Let me try that today. Date: Fri, 20 Dec 2013 21:55:44 -0500 From: chris.maw...@gmail.com To: user@hadoop.apache.org Subject: Re: libjar and Mahout
In your hadoop command I see a space in the part ...-core-0.9-SNAPSHOT.jar /:/apps/mahout/trunk just after .jar Should it not be ...-core-0.9-SNAPSHOT.jar:/apps/mahout/trunk Chris On 12/20/2013 2:44 PM, Sameer Tilak wrote: Hi All, I am running Hadoop 1.0.3 -- probably will upgrade mid-next year. We are using Apache Pig to build our data pipeline and are planning to use Apache Mahout for data analysis. javac -d /apps/analytics/ -classpath .:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar:/users/p529444/software/hadoop-1.0.3/hadoop-core-1.0.3.jar:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar:/users/p529444/software/hadoop-1.0.3/hadoop-tools-1.0.3.jar:/users/p529444/software/hadoop-1.0.3/lib/commons-logging-1.1.1.jar SimpleKMeansClustering.java jar -cf myanalytics.jar myanalytics/ hadoop jar /apps/analytics/myanalytics.jar myanalytics.SimpleKMeansClustering -libjars /apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar /:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar I have call the following method in my SimpleKMeansClustering class: KMeansDriver.run(conf, new Path("/scratch/dummyvector.seq"), new Path("/scratch/dummyvector-initclusters/part-randomSeed/"), new Path("/scratch/dummyvectoroutput"), new EuclideanDistanceMeasure(), 0.001, 10, true, 1.0, false); I unfortunately get the following error, In think somehow the jars are not made available in the distributed cached. I use Vectors to repreent my data and I write it to a sequence file. I then use that Driver to analyze that in the mapreduce mode. I think locally all the required jar files are available, however somehow in the mapreduce mode they are not available. Any help with this would be great! 13/12/19 16:59:02 INFO kmeans.KMeansDriver: Input: /scratch/dummyvector.seq Clusters In: /scratch/dummyvector-initclusters/part-randomSeed Out: /scratch/dummyvectoroutput Distance: org.apache.mahout.common.distance.EuclideanDistanceMeasure 13/12/19 16:59:02 INFO kmeans.KMeansDriver: convergence: 0.001 max Iterations: 10 13/12/19 16:59:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/12/19 16:59:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 13/12/19 16:59:02 INFO compress.CodecPool: Got brand-new decompressor 13/12/19 16:59:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/12/19 16:59:02 INFO input.FileInputFormat: Total input paths to process : 1 13/12/19 16:59:03 INFO mapred.JobClient: Running job: job_201311111627_0310 13/12/19 16:59:04 INFO mapred.JobClient: map 0% reduce 0% 13/12/19 16:59:19 INFO mapred.JobClient: Task Id : attempt_201311111627_0310_m_000000_0, Status : FAILED Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71) at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1671) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1613) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50) To resolve this, I came across this article: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ The information says that "Include the JAR in the “-libjars” command line option of the `hadoop jar …` command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts." For the hadoop command line options and the method 1 to work the main class should implement Tool and call ToolRunner.run(). Therefore I changed the class as follows: I was getting an error that public class SimpleKMeansClustering extends Configured implements Tool { Code.... public int run(String[] args) throws Exception { // Configuration conf = new Configuration(); Configuration conf = getConf(); FileSystem fs = FileSystem.get(conf); Job job = new Job(conf, "SimpleKMeansClustering"); //to accept the hdfs input and outpur dir at run time FileInputFormat.addInputPath(job, new Path("/scratch/dummyvector.seq")); FileOutputFormat.setOutputPath(job, new Path("/scratch/dummyvectoroutput")); SimpleKMeansClustering smkc = new SimpleKMeansClustering(); System.out.println ("SimpleKMeansClustering::main -- Wiil call SequenceFile.Writer \n"); populateData(); writePointsToFile("/scratch/dummyvector.seq",fs,conf); readPointsFromFile(fs, conf); runKmeansDriver(conf); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String args[]) throws Exception { int res = ToolRunner.run(new SimpleKMeansClustering(), args); System.exit(res); } } I am having some issues with the new and old API. Can someone please point me in the correct direction? SimpleKMeansClustering.java:148: error: method addInputPath in class FileInputFormat<K,V> cannot be applied to given types; FileInputFormat.addInputPath(job, new Path("/scratch/dummyvector.seq")); ^ required: JobConf,Path found: Job,Path reason: actual argument Job cannot be converted to JobConf by method invocation conversion where K,V are type-variables: K extends Object declared in class FileInputFormat V extends Object declared in class FileInputFormat SimpleKMeansClustering.java:149: error: method setOutputPath in class FileOutputFormat<K,V> cannot be applied to given types; FileOutputFormat.setOutputPath(job, new Path("/scratch/dummyvectoroutput")); ^ required: JobConf,Path found: Job,Path reason: actual argument Job cannot be converted to JobConf by method invocation conversion where K,V are type-variables: K extends Object declared in class FileOutputFormat V extends Object declared in class FileOutputFormat 2 errors