Problems with ItemBasedRecommender with Lucene

Thomas Rewig Wed, 16 Sep 2009 06:49:07 -0700

Hello,

I build a "real time ItemBasedRecommender" based on a users history anda (sparse) item similarity matrix with lucene. Some time ago Ted Dunningrecommended me this approach at the mahout mailing list to create aItemBasedRecommender:

"It is actually very easy to do. The output of the recommendationoff-line process is generally a sparse matrix of item-item links. Eachline of this sparse matrix can be considered a document in creating aLucene index. You will have to use a correct analyzer and a line by linedocument segmenter, but that is trivial. Then recommendation is a simplequery step."

So for 100000 items it works fine - but for 1 million items the Indexingfails and I have no idea how to avoid this. Maybe you can give me a hint.

First I create a Item-Item-Similaritymatrix with mahout's taste and inthe second step I index it. The matrix is sparce because onlyItem-Item-Relations with a high correlation will be saved.


Here are the Code Snippets for this indexing :

CachedRowSetImpl rowSetMainItemList = null; // Mapping of ItemsArrayList<String> listBelongingItems = null; // Belonging andhighest correlating Items for a MainItem

       Document aDocument = null;
       Field aField = null;
       Field aField1 = null;

Analyzer aAnalyzer = new StandardAnalyzer();IndexWriter aWriter = new IndexWriter(this.indexDirectory,aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);aWriter.setRAMBufferSizeMB(48);rowSetMainItemList = getRowSetItemList(); //get all ItemsaField1 = new Field("Item1", "",Field.Store.YES,Field.Index.ANALYZED); // reuse this fieldwhile (rowSetMainItemList.next()){aDocument = new Document();aField1.setValue(rowSetMainItemList.getString(1));aDocument.add(aField1);listBelongingItems =getRowSetBelongingItems(rowSetMainItemList.getString(1)); // get themost similar Items fpr a ItemIterator<String> itrBelongingItems =listBelongingItems.iterator();while (itrBelongingItems.hasNext()){String strBelongingItem = (String) itrBelongingItems.next();//No reuse of Field possible because of differentfieldnames:aField = new Field(strBelongingItem,"1",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);

               aDocument.add(aField);
           }

aWriter.addDocument(aDocument);}aWriter.optimize();

       aWriter.close();

aAnalyzer.close();Actually the Field of the BelongingItem have to be boosted with theMainItem-BelongingItem-Correlation-Value to get accurateRecommendations, but here the Index would be about 80 GByte for 6million items... without it will only be about 2Gbyte.But under the condition that only relevant Correlations will be saved inthe Similaritymatrix the recommendation quality will be good enough.

The item recommendation for a User is a simple BooleanQuery withuserhistory boosted TermQuerys. Here I search for documents with thelargest Correspondence regarding the userhistory. So I look in whichDocuments the most Fields with the name of a BelongingItem are set (withvalue 1) and recommend the "key"-value which was set inaField1("Item"...)Whatever, as i mentioned it worked for a Number of 100000 Items. But ifthere are 1 million items the indexing crash after a while with


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
       at java.util.HashMap.resize(HashMap.java:462)
       at java.util.HashMap.addEntry(HashMap.java:755)
       at java.util.HashMap.put(HashMap.java:385)
       at java.util.HashSet.add(HashSet.java:200)
       at org.apache.lucene.index.DocInverter.flush(DocInverter.java:66)

atorg.apache.lucene.index.DocFieldConsumers.flush(DocFieldConsumers.java:75)atorg.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:60)atorg.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574)atorg.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3540)

       at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3450)

atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1937)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)

if I increase the Java heap space there will be a "OutOfMemoryError:/PermGen space" /Exception.If I increase the PermGen Space -XX:MaxPermSize=1024m the Java heapspace is still the limiting factor.I can increase both to the maximum of my system - 20Gbyte Ram areavailable - but this doesn't solve the problem.

Through indexing the ram-memory consumtion growing steadily until itchrashes. It does not matter if I index the data in segments with openand close each time the IndexWriter or if I optimize the indexperiodically - the ram-memory consumtion is still growing ...

I think the problem is, that I can't reuse the field aField for myapproach and it seems the GC doesn't collect it. Extrapolated thats 600Million unique fields...


I'm using lucene 2.4.1 and java version "1.6.0_16".

Do anyone have an idea to avoid the growing memory. Or do somebody knowan other approche for a "realtime Item based Recommender" with Lucene?


Regards

Thomas

--



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Problems with ItemBasedRecommender with Lucene

Reply via email to