Hi Grant, thanks for your answers - it seems to work with a heap of 4GB - but is fairly slow. I'd be interested in seeing if we could make this process distributed? It's running as a standalone right now and thus is a bottleneck...
Are there any attempts right now to implement it in a M/R fashion? Thanks, Florian On Mon, Jul 20, 2009 at 5:49 PM, Grant Ingersoll <[email protected]>wrote: > > > > On Jul 20, 2009, at 2:40 PM, Florian Leibert wrote: > > Hi, >> I'm trying to create vectors with Mahout as explained in >> >> http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text >> , >> however I keep running out of heap. My heap is set to 2 GB already and I >> use >> these parameters: >> "java org.apache.mahout.utils.vectors.Driver --dir /LUCENE/ind --output >> /user/florian/index-vectors-01 --field content --dictOut >> /user/florian/index-dict-01 --weight TF". >> > > Hmm, 6GB isn't all that large, but the primary memory usage is going to be > due to the CachedTermInfo, which loads all the terms into memory. This is > an interface that can be implemented in other, slower, ways, but we'll have > to change the Driver program to allow for that. > > How many unique terms do you have in the content field? > > You have java -Xmx2000M set as the heap size? > > > >> My index currently is about 6 GB large. Is there any way to compute the >> vectors in a distributed manner? >> > > There will be, but there isn't yet, I suspect. > > > What's the largest index someone has >> created vectors from? >> > > It's pretty new code, I've only tested it on relatively small indexes (few > 100 mgs) so far, but the only gating issue memory wise is the > CachedTermInfo. > > Sorry I don't have better answers, but I am willing to help improve. I > will try to use some bigger indexes soon. > > -Grant >
