How can I convert my Sequence<Text,VectorWritable> to SequenceFile<IntWritable,VectorWritable> Is there any other way I can parse my documents directory to get vectors and then get similar documents ?
As I know Rowsimilarityjob would give me similar rows(text terms similar documents) Am I correct ? -----Original Message----- From: Sebastian Schelter [mailto:[email protected]] Sent: Friday, October 29, 2010 3:45 PM To: [email protected] Subject: Re: RowSimilarityjob +user -dev The input files need to be SequenceFile<IntWritable,VectorWritable>. RowSimilarityJob is intended to become a method invokable on DistributedRowMatrix as soon as that is ported to the new hadoop api. --sebastian Am 29.10.2010 08:33, schrieb Divya: > Hi, > > > > What will be the input to RowSimilarityJob ? > > > > When I passed tfidf-vectors files as input parameter > > I got following error > > > > Oct 29, 2010 2:21:35 PM org.apache.hadoop.mapred.LocalJobRunner$Job run > > WARNING: job_local_0001 > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.io.IntWritable > > at > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.ma > p(RowSimilarityJob.java:1) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Job complete: job_local_0001 > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.mapred.Counters log > > INFO: Counters: 0 > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.metrics.jvm.JvmMetrics init > > INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= > - already initialized > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.mapred.JobClient > configureCommandLineOptions > > WARNING: No job jar file set. User classes may not be found. See > JobConf(Class) or JobConf#setJar(String). > > Oct 29, 2010 2:21:36 PM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > > INFO: Total input paths to process : 0 > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Running job: job_local_0002 > > Oct 29, 2010 2:21:36 PM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > > INFO: Total input paths to process : 0 > > Oct 29, 2010 2:21:36 PM org.apache.hadoop.mapred.LocalJobRunner$Job run > > WARNING: job_local_0002 > > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > > at java.util.ArrayList.get(ArrayList.java:322) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) > > Oct 29, 2010 2:21:37 PM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: map 0% reduce 0% > > Oct 29, 2010 2:21:37 PM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Job complete: job_local_0002 > > Oct 29, 2010 2:21:37 PM org.apache.hadoop.mapred.Counters log > > INFO: Counters: 0 > > Oct 29, 2010 2:21:37 PM org.apache.hadoop.metrics.jvm.JvmMetrics init > > INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= > - already initialized > > Oct 29, 2010 2:21:38 PM org.apache.hadoop.mapred.JobClient > configureCommandLineOptions > > WARNING: No job jar file set. User classes may not be found. See > JobConf(Class) or JobConf#setJar(String). > > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: temp/pairwiseSimilarity > > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFo > rmat.java:224) > > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(Seq > uenceFileInputFormat.java:55) > > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFor > mat.java:241) > > at > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) > > at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) > > at > org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) > > at > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.run(RowSimilarityJ > ob.java:174) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.main(RowSimilarity > Job.java:86) > > > > > > > > Its creating temp/weights directory but it is empty > > and its not at all creating pairwiseSimilarity > > so the other part of error I can figure it out.. > > but why java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable > > Unable to find out L > > > > Wondering whether my input is correct or not ? > > > > > > > > > > Regards, > > Divya > > > >
