Hi Sebastian, I am currently using your code with NamedVectors in my input. In the output, however, the names seem to be missing. Would there be a way to include them?
Thanks, Kris 2010/6/29 Sebastian Schelter <[email protected]> > Hi Kris, > > I'm glad I could help you and it's really cool that you are testing my > patches on real data. I'm looking forward to hearing more! > > -sebastian > > Am 29.06.2010 11:25, schrieb Kris Jack: > > Hi Sebastian, > > > > You really are very kind! I have taken your code and run it to print out > > the contents of the output file. There are indeed only 37,952 results so > > that gives me more confidence in the vector dumper. I'm not sure why > there > > was a memory problem though, seeing as it seems to have output the > results > > correctly. Now I just have to match them up with my original lucene ids > and > > see how it is performing. I'll keep you posted with the results. > > > > Thanks, > > Kris > > > > > > > > 2010/6/28 Sebastian Schelter <[email protected]> > > > > > >> Hi Kris, > >> > >> Unfortunately I'm not familiar with the VectorDumper code (and a quick > >> look didn't help either), so I can't help you with the OutOfMemoryError. > >> > >> It could be possible that only 37,952 results are found for an input of > >> 500,000 vectors, it really depends on the actual data. If you're sure > >> that there should be more results, you could provide me with a sample > >> input file and I'll try to find out why there aren't more results. > >> > >> I wrote a small class for you that dumps the output file of the job to > >> the console, (I tested it with the output of my unit-tests), maybe that > >> can help us find the source of the problem. > >> > >> -sebastian > >> > >> public class MatrixReader extends AbstractJob { > >> > >> public static void main(String[] args) throws Exception { > >> ToolRunner.run(new MatrixReader(), args); > >> } > >> > >> @Override > >> public int run(String[] args) throws Exception { > >> > >> addInputOption(); > >> > >> Map<String,String> parsedArgs = parseArguments(args); > >> if (parsedArgs == null) { > >> return -1; > >> } > >> > >> Configuration conf = getConf(); > >> FileSystem fs = FileSystem.get(conf); > >> > >> Path vectorFile = fs.listStatus(getInputPath(), > >> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); > >> > >> SequenceFile.Reader reader = null; > >> try { > >> reader = new SequenceFile.Reader(fs, vectorFile, conf); > >> IntWritable key = new IntWritable(); > >> VectorWritable value = new VectorWritable(); > >> > >> while (reader.next(key, value)) { > >> int row = key.get(); > >> System.out.print(String.valueOf(key.get()) + ": "); > >> Iterator<Element> elementsIterator = > value.get().iterateNonZero(); > >> String separator = ""; > >> while (elementsIterator.hasNext()) { > >> Element element = elementsIterator.next(); > >> System.out.print(separator + String.valueOf(element.index()) + > >> "," + String.valueOf(element.get())); > >> separator = ";"; > >> } > >> System.out.print("\n"); > >> } > >> } finally { > >> reader.close(); > >> } > >> return 0; > >> } > >> } > >> > >> Am 28.06.2010 17:18, schrieb Kris Jack: > >> > >>> Hi, > >>> > >>> I am now using the version of > >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that > Sebastian > >>> > >> has > >> > >>> written and has been added to the trunk. Thanks again for that! I can > >>> generate an output file that should contain a list of documents with > >>> > >> their > >> > >>> top 100* *most similar documents. I am having problems, however, in > >>> converting the output file into a readable format using mahout's > >>> > >> vectordump: > >> > >>> $ ./mahout vectordump --seqFile similarRows --output results.out > >>> > >> --printKey > >> > >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally > >>> Input Path: /home/kris/similarRows > >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >>> at > >>> > >>> > >> > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) > >> > >>> at > >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > >>> at > >>> > >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) > >> > >>> at > >>> > >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) > >> > >>> at > >>> > >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > >> > >>> at > >>> > >>> > >> > org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) > >> > >>> at > >>> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >>> > >>> > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >> > >>> at > >>> > >>> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > >>> at java.lang.reflect.Method.invoke(Method.java:597) > >>> at > >>> > >>> > >> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > >> > >>> at > >>> > >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > >> > >>> at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) > >>> > >>> What is this doing that takes up so much memory? A file is produced > with > >>> 37,952 readable rows but I'm expecting more like 500,000 results, since > I > >>> have this number of documents. Should I be using something else to > read > >>> > >> the > >> > >>> output file of the RowSimilarityJob? > >>> > >>> Thanks, > >>> Kris > >>> > >>> > >>> > >>> 2010/6/18 Sebastian Schelter <[email protected]> > >>> > >>> > >>> > >>>> Hi Kris, > >>>> > >>>> maybe you want to give the patch from > >>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not > yet > >>>> tested it with larger data yet, but I would be happy to get some > >>>> feedback for it and maybe it helps you with your usecase. > >>>> > >>>> -sebastian > >>>> > >>>> Am 18.06.2010 18:46, schrieb Kris Jack: > >>>> > >>>> > >>>>> Thanks Ted, > >>>>> > >>>>> I got that working. Unfortunately, the matrix multiplication job is > >>>>> > >>>>> > >>>> taking > >>>> > >>>> > >>>>> far longer than I hoped. With just over 10 million documents, 10 > >>>>> > >> mappers > >> > >>>>> and 10 reducers, I can't get it to complete the job in under 48 > hours. > >>>>> > >>>>> Perhaps you have an idea for speeding it up? I have already been > quite > >>>>> ruthless with making the vectors sparse. I did not include terms > that > >>>>> appeared in over 1% of the corpus and only kept terms that appeared > at > >>>>> > >>>>> > >>>> least > >>>> > >>>> > >>>>> 50 times. Is it normal that the matrix multiplication map reduce > task > >>>>> should take so long to process with this quantity of data and > resources > >>>>> available or do you think that my system is not configured properly? > >>>>> > >>>>> Thanks, > >>>>> Kris > >>>>> > >>>>> > >>>>> > >>>>> 2010/6/15 Ted Dunning <[email protected]> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> Threshold are generally dangerous. It is usually preferable to > >>>>>> > >> specify > >> > >>>>>> > >>>> the > >>>> > >>>> > >>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in > >>>>>> > >> descending > >> > >>>>>> score order using Hadoop's builtin capabilities and just drop the > >>>>>> > >> rest. > >> > >>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> > >>>>>> > >>>>>> > >>>> wrote: > >>>> > >>>> > >>>>>> > >>>>>> > >>>>>>> I was wondering if there was an > >>>>>>> interesting way to do this with the current mahout code such as > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> requesting > >>>>>> > >>>>>> > >>>>>> > >>>>>>> that the Vector accumulator returns only elements that have values > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> greater > >>>>>> > >>>>>> > >>>>>> > >>>>>>> than a given threshold, sorting the vector by value rather than > key, > >>>>>>> > >> or > >> > >>>>>>> something else? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
