Re: Generating a Document Similarity Matrix

Kris Jack Fri, 02 Jul 2010 09:23:11 -0700

Hi Sebastian,

I am currently using your code with NamedVectors in my input.  In the
output, however, the names seem to be missing.  Would there be a way to
include them?


Thanks,
Kris



2010/6/29 Sebastian Schelter <[email protected]>

> Hi Kris,
>
> I'm glad I could help you and it's really cool that you are testing my
> patches on real data. I'm looking forward to hearing more!
>
> -sebastian
>
> Am 29.06.2010 11:25, schrieb Kris Jack:
> > Hi Sebastian,
> >
> > You really are very kind!  I have taken your code and run it to print out
> > the contents of the output file.  There are indeed only 37,952 results so
> > that gives me more confidence in the vector dumper.  I'm not sure why
> there
> > was a memory problem though, seeing as it seems to have output the
> results
> > correctly.  Now I just have to match them up with my original lucene ids
> and
> > see how it is performing.  I'll keep you posted with the results.
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/28 Sebastian Schelter <[email protected]>
> >
> >
> >> Hi Kris,
> >>
> >> Unfortunately I'm not familiar with the VectorDumper code (and a quick
> >> look didn't help either), so I can't help you with the OutOfMemoryError.
> >>
> >> It could be possible that only 37,952 results are found for an input of
> >> 500,000 vectors, it really depends on the actual data. If you're sure
> >> that there should be more results, you could provide me with a sample
> >> input file and I'll try to find out why there aren't more results.
> >>
> >> I wrote a small class for you that dumps the output file of the job to
> >> the console, (I tested it with the output of my unit-tests), maybe that
> >> can help us find the source of the problem.
> >>
> >> -sebastian
> >>
> >> public class MatrixReader extends AbstractJob {
> >>
> >>  public static void main(String[] args) throws Exception {
> >>    ToolRunner.run(new MatrixReader(), args);
> >>  }
> >>
> >>  @Override
> >>  public int run(String[] args) throws Exception {
> >>
> >>    addInputOption();
> >>
> >>    Map<String,String> parsedArgs = parseArguments(args);
> >>    if (parsedArgs == null) {
> >>      return -1;
> >>    }
> >>
> >>    Configuration conf = getConf();
> >>    FileSystem fs = FileSystem.get(conf);
> >>
> >>    Path vectorFile = fs.listStatus(getInputPath(),
> >> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
> >>
> >>    SequenceFile.Reader reader = null;
> >>    try {
> >>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
> >>      IntWritable key = new IntWritable();
> >>      VectorWritable value = new VectorWritable();
> >>
> >>      while (reader.next(key, value)) {
> >>        int row = key.get();
> >>        System.out.print(String.valueOf(key.get()) +  ": ");
> >>        Iterator<Element> elementsIterator =
> value.get().iterateNonZero();
> >>        String separator = "";
> >>        while (elementsIterator.hasNext()) {
> >>          Element element = elementsIterator.next();
> >>          System.out.print(separator + String.valueOf(element.index()) +
> >> "," + String.valueOf(element.get()));
> >>          separator = ";";
> >>        }
> >>        System.out.print("\n");
> >>      }
> >>    } finally {
> >>      reader.close();
> >>    }
> >>    return 0;
> >>  }
> >> }
> >>
> >> Am 28.06.2010 17:18, schrieb Kris Jack:
> >>
> >>> Hi,
> >>>
> >>> I am now using the version of
> >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
> Sebastian
> >>>
> >> has
> >>
> >>> written and has been added to the trunk.  Thanks again for that!  I can
> >>> generate an output file that should contain a list of documents with
> >>>
> >> their
> >>
> >>> top 100* *most similar documents.  I am having problems, however, in
> >>> converting the output file into a readable format using mahout's
> >>>
> >> vectordump:
> >>
> >>> $ ./mahout vectordump --seqFile similarRows --output results.out
> >>>
> >> --printKey
> >>
> >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> >>> Input Path: /home/kris/similarRows
> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>>     at
> >>>
> >>>
> >>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
> >>
> >>>     at
> >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >>
> >>>     at
> >>>
> >>>
> >>
> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
> >>
> >>>     at
> >>>
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
> >>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>     at
> >>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>
> >>>     at
> >>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>
> >>>     at java.lang.reflect.Method.invoke(Method.java:597)
> >>>     at
> >>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >>>     at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
> >>>
> >>> What is this doing that takes up so much memory?  A file is produced
> with
> >>> 37,952 readable rows but I'm expecting more like 500,000 results, since
> I
> >>> have this number of documents.  Should I be using something else to
> read
> >>>
> >> the
> >>
> >>> output file of the RowSimilarityJob?
> >>>
> >>> Thanks,
> >>> Kris
> >>>
> >>>
> >>>
> >>> 2010/6/18 Sebastian Schelter <[email protected]>
> >>>
> >>>
> >>>
> >>>> Hi Kris,
> >>>>
> >>>> maybe you want to give the patch from
> >>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
> yet
> >>>> tested it with larger data yet, but I would be happy to get some
> >>>> feedback for it and maybe it helps you with your usecase.
> >>>>
> >>>> -sebastian
> >>>>
> >>>> Am 18.06.2010 18:46, schrieb Kris Jack:
> >>>>
> >>>>
> >>>>> Thanks Ted,
> >>>>>
> >>>>> I got that working.  Unfortunately, the matrix multiplication job is
> >>>>>
> >>>>>
> >>>> taking
> >>>>
> >>>>
> >>>>> far longer than I hoped.  With just over 10 million documents, 10
> >>>>>
> >> mappers
> >>
> >>>>> and 10 reducers, I can't get it to complete the job in under 48
> hours.
> >>>>>
> >>>>> Perhaps you have an idea for speeding it up?  I have already been
> quite
> >>>>> ruthless with making the vectors sparse.  I did not include terms
> that
> >>>>> appeared in over 1% of the corpus and only kept terms that appeared
> at
> >>>>>
> >>>>>
> >>>> least
> >>>>
> >>>>
> >>>>> 50 times.  Is it normal that the matrix multiplication map reduce
> task
> >>>>> should take so long to process with this quantity of data and
> resources
> >>>>> available or do you think that my system is not configured properly?
> >>>>>
> >>>>> Thanks,
> >>>>> Kris
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2010/6/15 Ted Dunning <[email protected]>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Threshold are generally dangerous.  It is usually preferable to
> >>>>>>
> >> specify
> >>
> >>>>>>
> >>>> the
> >>>>
> >>>>
> >>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
> >>>>>>
> >> descending
> >>
> >>>>>> score order using Hadoop's builtin capabilities and just drop the
> >>>>>>
> >> rest.
> >>
> >>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]>
> >>>>>>
> >>>>>>
> >>>> wrote:
> >>>>
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>>  I was wondering if there was an
> >>>>>>> interesting way to do this with the current mahout code such as
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> requesting
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> that the Vector accumulator returns only elements that have values
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> greater
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> than a given threshold, sorting the vector by value rather than
> key,
> >>>>>>>
> >> or
> >>
> >>>>>>> something else?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Reply via email to