Hi Kris, I'm glad I could help you and it's really cool that you are testing my patches on real data. I'm looking forward to hearing more!
-sebastian Am 29.06.2010 11:25, schrieb Kris Jack: > Hi Sebastian, > > You really are very kind! I have taken your code and run it to print out > the contents of the output file. There are indeed only 37,952 results so > that gives me more confidence in the vector dumper. I'm not sure why there > was a memory problem though, seeing as it seems to have output the results > correctly. Now I just have to match them up with my original lucene ids and > see how it is performing. I'll keep you posted with the results. > > Thanks, > Kris > > > > 2010/6/28 Sebastian Schelter <[email protected]> > > >> Hi Kris, >> >> Unfortunately I'm not familiar with the VectorDumper code (and a quick >> look didn't help either), so I can't help you with the OutOfMemoryError. >> >> It could be possible that only 37,952 results are found for an input of >> 500,000 vectors, it really depends on the actual data. If you're sure >> that there should be more results, you could provide me with a sample >> input file and I'll try to find out why there aren't more results. >> >> I wrote a small class for you that dumps the output file of the job to >> the console, (I tested it with the output of my unit-tests), maybe that >> can help us find the source of the problem. >> >> -sebastian >> >> public class MatrixReader extends AbstractJob { >> >> public static void main(String[] args) throws Exception { >> ToolRunner.run(new MatrixReader(), args); >> } >> >> @Override >> public int run(String[] args) throws Exception { >> >> addInputOption(); >> >> Map<String,String> parsedArgs = parseArguments(args); >> if (parsedArgs == null) { >> return -1; >> } >> >> Configuration conf = getConf(); >> FileSystem fs = FileSystem.get(conf); >> >> Path vectorFile = fs.listStatus(getInputPath(), >> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); >> >> SequenceFile.Reader reader = null; >> try { >> reader = new SequenceFile.Reader(fs, vectorFile, conf); >> IntWritable key = new IntWritable(); >> VectorWritable value = new VectorWritable(); >> >> while (reader.next(key, value)) { >> int row = key.get(); >> System.out.print(String.valueOf(key.get()) + ": "); >> Iterator<Element> elementsIterator = value.get().iterateNonZero(); >> String separator = ""; >> while (elementsIterator.hasNext()) { >> Element element = elementsIterator.next(); >> System.out.print(separator + String.valueOf(element.index()) + >> "," + String.valueOf(element.get())); >> separator = ";"; >> } >> System.out.print("\n"); >> } >> } finally { >> reader.close(); >> } >> return 0; >> } >> } >> >> Am 28.06.2010 17:18, schrieb Kris Jack: >> >>> Hi, >>> >>> I am now using the version of >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian >>> >> has >> >>> written and has been added to the trunk. Thanks again for that! I can >>> generate an output file that should contain a list of documents with >>> >> their >> >>> top 100* *most similar documents. I am having problems, however, in >>> converting the output file into a readable format using mahout's >>> >> vectordump: >> >>> $ ./mahout vectordump --seqFile similarRows --output results.out >>> >> --printKey >> >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally >>> Input Path: /home/kris/similarRows >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >>> at >>> >>> >> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) >> >>> at >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) >>> at >>> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) >> >>> at >>> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) >> >>> at >>> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) >> >>> at >>> >>> >> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) >> >>> at >>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> >>> at >>> >>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> >>> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> >>> at >>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> >>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) >>> >>> What is this doing that takes up so much memory? A file is produced with >>> 37,952 readable rows but I'm expecting more like 500,000 results, since I >>> have this number of documents. Should I be using something else to read >>> >> the >> >>> output file of the RowSimilarityJob? >>> >>> Thanks, >>> Kris >>> >>> >>> >>> 2010/6/18 Sebastian Schelter <[email protected]> >>> >>> >>> >>>> Hi Kris, >>>> >>>> maybe you want to give the patch from >>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet >>>> tested it with larger data yet, but I would be happy to get some >>>> feedback for it and maybe it helps you with your usecase. >>>> >>>> -sebastian >>>> >>>> Am 18.06.2010 18:46, schrieb Kris Jack: >>>> >>>> >>>>> Thanks Ted, >>>>> >>>>> I got that working. Unfortunately, the matrix multiplication job is >>>>> >>>>> >>>> taking >>>> >>>> >>>>> far longer than I hoped. With just over 10 million documents, 10 >>>>> >> mappers >> >>>>> and 10 reducers, I can't get it to complete the job in under 48 hours. >>>>> >>>>> Perhaps you have an idea for speeding it up? I have already been quite >>>>> ruthless with making the vectors sparse. I did not include terms that >>>>> appeared in over 1% of the corpus and only kept terms that appeared at >>>>> >>>>> >>>> least >>>> >>>> >>>>> 50 times. Is it normal that the matrix multiplication map reduce task >>>>> should take so long to process with this quantity of data and resources >>>>> available or do you think that my system is not configured properly? >>>>> >>>>> Thanks, >>>>> Kris >>>>> >>>>> >>>>> >>>>> 2010/6/15 Ted Dunning <[email protected]> >>>>> >>>>> >>>>> >>>>> >>>>>> Threshold are generally dangerous. It is usually preferable to >>>>>> >> specify >> >>>>>> >>>> the >>>> >>>> >>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in >>>>>> >> descending >> >>>>>> score order using Hadoop's builtin capabilities and just drop the >>>>>> >> rest. >> >>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> >>>>>> >>>>>> >>>> wrote: >>>> >>>> >>>>>> >>>>>> >>>>>>> I was wondering if there was an >>>>>>> interesting way to do this with the current mahout code such as >>>>>>> >>>>>>> >>>>>>> >>>>>> requesting >>>>>> >>>>>> >>>>>> >>>>>>> that the Vector accumulator returns only elements that have values >>>>>>> >>>>>>> >>>>>>> >>>>>> greater >>>>>> >>>>>> >>>>>> >>>>>>> than a given threshold, sorting the vector by value rather than key, >>>>>>> >> or >> >>>>>>> something else? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
