Re: libjar and Mahout

Chris Mawata Fri, 20 Dec 2013 18:57:15 -0800

In your hadoop command I see a space in the part
...-core-0.9-SNAPSHOT.jar /:/apps/mahout/trunk


just after .jar
Should it not be
...-core-0.9-SNAPSHOT.jar:/apps/mahout/trunk
Chris

On 12/20/2013 2:44 PM, Sameer Tilak wrote:

Hi All,
I am running Hadoop 1.0.3 -- probably will upgrade mid-next year. Weare using Apache Pig to build our data pipeline and are planning touse Apache Mahout for data analysis.
javac -d /apps/analytics/ -classpath.:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar:/users/p529444/software/hadoop-1.0.3/hadoop-core-1.0.3.jar:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar:/users/p529444/software/hadoop-1.0.3/hadoop-tools-1.0.3.jar:/users/p529444/software/hadoop-1.0.3/lib/commons-logging-1.1.1.jarSimpleKMeansClustering.java
jar -cf myanalytics.jar myanalytics/
hadoop jar /apps/analytics/myanalytics.jarmyanalytics.SimpleKMeansClustering -libjars/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar/:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar
I have call the following method in my SimpleKMeansClustering class:
KMeansDriver.run(conf, newPath("/scratch/dummyvector.seq"), newPath("/scratch/dummyvector-initclusters/part-randomSeed/"),new Path("/scratch/dummyvectoroutput"),new EuclideanDistanceMeasure(), 0.001, 10,
                             true, 1.0, false);
I unfortunately get the following error, In think somehow the jars arenot made available in the distributed cached. I use Vectors torepreent my data and I write it to a sequence file. I then use thatDriver to analyze that in the mapreduce mode. I think locally all therequired jar files are available, however somehow in the mapreducemode they are not available. Any help with this would be great!
13/12/19 16:59:02 INFO kmeans.KMeansDriver: Input:/scratch/dummyvector.seq Clusters In:/scratch/dummyvector-initclusters/part-randomSeed Out:/scratch/dummyvectoroutput Distance:org.apache.mahout.common.distance.EuclideanDistanceMeasure13/12/19 16:59:02 INFO kmeans.KMeansDriver: convergence: 0.001 maxIterations: 1013/12/19 16:59:02 INFO util.NativeCodeLoader: Loaded the native-hadooplibrary13/12/19 16:59:02 INFO zlib.ZlibFactory: Successfully loaded &initialized native-zlib library
13/12/19 16:59:02 INFO compress.CodecPool: Got brand-new decompressor
13/12/19 16:59:02 WARN mapred.JobClient: Use GenericOptionsParser forparsing the arguments. Applications should implement Tool for the same.13/12/19 16:59:02 INFO input.FileInputFormat: Total input paths toprocess : 113/12/19 16:59:03 INFO mapred.JobClient: Running job:job_201311111627_0310
13/12/19 16:59:04 INFO mapred.JobClient:  map 0% reduce 0%
13/12/19 16:59:19 INFO mapred.JobClient: Task Id :attempt_201311111627_0310_m_000000_0, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
atorg.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
    at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71)
atorg.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1671)atorg.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1613)atorg.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)atorg.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)atorg.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)atorg.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
To resolve this, I came across this article:
http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
The information says that "Include the JAR in the “/-libjars/” commandline option of the `hadoop jar …` command. The jar will be placed indistributed cache<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache>and will be made available to all of the job’s task attempts."
For the hadoop command line options and the method 1 to work the mainclass should implement Tool and call ToolRunner.run(). Therefore Ichanged the class as follows:
I was getting an error that

public class SimpleKMeansClustering extends Configured implements Tool {
Code....

 public int run(String[] args) throws Exception
    {
        //      Configuration conf = new Configuration();
        Configuration conf = getConf();
        FileSystem fs = FileSystem.get(conf);
        Job job = new Job(conf, "SimpleKMeansClustering");

        //to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, newPath("/scratch/dummyvector.seq"));FileOutputFormat.setOutputPath(job, newPath("/scratch/dummyvectoroutput"));
    SimpleKMeansClustering smkc = new SimpleKMeansClustering();
System.out.println ("SimpleKMeansClustering::main -- Wiil callSequenceFile.Writer \n");
    populateData();
     writePointsToFile("/scratch/dummyvector.seq",fs,conf);
    readPointsFromFile(fs, conf);
    runKmeansDriver(conf);

    return job.waitForCompletion(true) ? 0 : 1;

    }
    public static void main(String args[]) throws Exception {

        int res = ToolRunner.run(new SimpleKMeansClustering(), args);
        System.exit(res);
    }
}
I am having some issues with the new and old API. Can someone pleasepoint me in the correct direction?
SimpleKMeansClustering.java:148: error: method addInputPath in classFileInputFormat<K,V> cannot be applied to given types;FileInputFormat.addInputPath(job, newPath("/scratch/dummyvector.seq"));
                   ^
  required: JobConf,Path
  found: Job,Path
reason: actual argument Job cannot be converted to JobConf by methodinvocation conversion
  where K,V are type-variables:
    K extends Object declared in class FileInputFormat
    V extends Object declared in class FileInputFormat
SimpleKMeansClustering.java:149: error: method setOutputPath in classFileOutputFormat<K,V> cannot be applied to given types;FileOutputFormat.setOutputPath(job, newPath("/scratch/dummyvectoroutput"));
                    ^
  required: JobConf,Path
  found: Job,Path
reason: actual argument Job cannot be converted to JobConf by methodinvocation conversion
  where K,V are type-variables:
    K extends Object declared in class FileOutputFormat
    V extends Object declared in class FileOutputFormat
2 errors

Re: libjar and Mahout

Reply via email to