Re: Generating a Document Similarity Matrix

Sebastian Schelter Tue, 29 Jun 2010 02:29:41 -0700

Hi Kris,

I'm glad I could help you and it's really cool that you are testing my
patches on real data. I'm looking forward to hearing more!


-sebastian

Am 29.06.2010 11:25, schrieb Kris Jack:
> Hi Sebastian,
>
> You really are very kind!  I have taken your code and run it to print out
> the contents of the output file.  There are indeed only 37,952 results so
> that gives me more confidence in the vector dumper.  I'm not sure why there
> was a memory problem though, seeing as it seems to have output the results
> correctly.  Now I just have to match them up with my original lucene ids and
> see how it is performing.  I'll keep you posted with the results.
>
> Thanks,
> Kris
>
>
>
> 2010/6/28 Sebastian Schelter <[email protected]>
>
>   
>> Hi Kris,
>>
>> Unfortunately I'm not familiar with the VectorDumper code (and a quick
>> look didn't help either), so I can't help you with the OutOfMemoryError.
>>
>> It could be possible that only 37,952 results are found for an input of
>> 500,000 vectors, it really depends on the actual data. If you're sure
>> that there should be more results, you could provide me with a sample
>> input file and I'll try to find out why there aren't more results.
>>
>> I wrote a small class for you that dumps the output file of the job to
>> the console, (I tested it with the output of my unit-tests), maybe that
>> can help us find the source of the problem.
>>
>> -sebastian
>>
>> public class MatrixReader extends AbstractJob {
>>
>>  public static void main(String[] args) throws Exception {
>>    ToolRunner.run(new MatrixReader(), args);
>>  }
>>
>>  @Override
>>  public int run(String[] args) throws Exception {
>>
>>    addInputOption();
>>
>>    Map<String,String> parsedArgs = parseArguments(args);
>>    if (parsedArgs == null) {
>>      return -1;
>>    }
>>
>>    Configuration conf = getConf();
>>    FileSystem fs = FileSystem.get(conf);
>>
>>    Path vectorFile = fs.listStatus(getInputPath(),
>> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>>
>>    SequenceFile.Reader reader = null;
>>    try {
>>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
>>      IntWritable key = new IntWritable();
>>      VectorWritable value = new VectorWritable();
>>
>>      while (reader.next(key, value)) {
>>        int row = key.get();
>>        System.out.print(String.valueOf(key.get()) +  ": ");
>>        Iterator<Element> elementsIterator = value.get().iterateNonZero();
>>        String separator = "";
>>        while (elementsIterator.hasNext()) {
>>          Element element = elementsIterator.next();
>>          System.out.print(separator + String.valueOf(element.index()) +
>> "," + String.valueOf(element.get()));
>>          separator = ";";
>>        }
>>        System.out.print("\n");
>>      }
>>    } finally {
>>      reader.close();
>>    }
>>    return 0;
>>  }
>> }
>>
>> Am 28.06.2010 17:18, schrieb Kris Jack:
>>     
>>> Hi,
>>>
>>> I am now using the version of
>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian
>>>       
>> has
>>     
>>> written and has been added to the trunk.  Thanks again for that!  I can
>>> generate an output file that should contain a list of documents with
>>>       
>> their
>>     
>>> top 100* *most similar documents.  I am having problems, however, in
>>> converting the output file into a readable format using mahout's
>>>       
>> vectordump:
>>     
>>> $ ./mahout vectordump --seqFile similarRows --output results.out
>>>       
>> --printKey
>>     
>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>> Input Path: /home/kris/similarRows
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>     at
>>>
>>>       
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>>     
>>>     at
>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>     at
>>>       
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>>     
>>>     at
>>>       
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>>     
>>>     at
>>>       
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>>     
>>>     at
>>>
>>>       
>> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>>     
>>>     at
>>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>     at
>>>
>>>       
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     
>>>     at
>>>
>>>       
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     
>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>     at
>>>
>>>       
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>     
>>>     at
>>>       
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>     
>>>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>>>
>>> What is this doing that takes up so much memory?  A file is produced with
>>> 37,952 readable rows but I'm expecting more like 500,000 results, since I
>>> have this number of documents.  Should I be using something else to read
>>>       
>> the
>>     
>>> output file of the RowSimilarityJob?
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/18 Sebastian Schelter <[email protected]>
>>>
>>>
>>>       
>>>> Hi Kris,
>>>>
>>>> maybe you want to give the patch from
>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
>>>> tested it with larger data yet, but I would be happy to get some
>>>> feedback for it and maybe it helps you with your usecase.
>>>>
>>>> -sebastian
>>>>
>>>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>>>
>>>>         
>>>>> Thanks Ted,
>>>>>
>>>>> I got that working.  Unfortunately, the matrix multiplication job is
>>>>>
>>>>>           
>>>> taking
>>>>
>>>>         
>>>>> far longer than I hoped.  With just over 10 million documents, 10
>>>>>           
>> mappers
>>     
>>>>> and 10 reducers, I can't get it to complete the job in under 48 hours.
>>>>>
>>>>> Perhaps you have an idea for speeding it up?  I have already been quite
>>>>> ruthless with making the vectors sparse.  I did not include terms that
>>>>> appeared in over 1% of the corpus and only kept terms that appeared at
>>>>>
>>>>>           
>>>> least
>>>>
>>>>         
>>>>> 50 times.  Is it normal that the matrix multiplication map reduce task
>>>>> should take so long to process with this quantity of data and resources
>>>>> available or do you think that my system is not configured properly?
>>>>>
>>>>> Thanks,
>>>>> Kris
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/15 Ted Dunning <[email protected]>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Threshold are generally dangerous.  It is usually preferable to
>>>>>>             
>> specify
>>     
>>>>>>             
>>>> the
>>>>
>>>>         
>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
>>>>>>             
>> descending
>>     
>>>>>> score order using Hadoop's builtin capabilities and just drop the
>>>>>>             
>> rest.
>>     
>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]>
>>>>>>
>>>>>>             
>>>> wrote:
>>>>
>>>>         
>>>>>>
>>>>>>             
>>>>>>>  I was wondering if there was an
>>>>>>> interesting way to do this with the current mahout code such as
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> requesting
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> that the Vector accumulator returns only elements that have values
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> greater
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> than a given threshold, sorting the vector by value rather than key,
>>>>>>>               
>> or
>>     
>>>>>>> something else?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>             
>>>>>           
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>

Re: Generating a Document Similarity Matrix

Reply via email to