Re: Problems with ItemBasedRecommender with Lucene

Grant Ingersoll Wed, 16 Sep 2009 08:06:45 -0700


On Sep 16, 2009, at 9:48 AM, Thomas Rewig wrote:

Hello,
I build a "real time ItemBasedRecommender" based on a users historyand a (sparse) item similarity matrix with lucene. Some time ago TedDunning recommended me this approach at the mahout mailing list tocreate a ItemBasedRecommender:
"It is actually very easy to do. The output of the recommendationoff-line process is generally a sparse matrix of item-item links.Each line of this sparse matrix can be considered a document increating a Lucene index. You will have to use a correct analyzer anda line by line document segmenter, but that is trivial. Thenrecommendation is a simple query step."
So for 100000 items it works fine - but for 1 million items theIndexing fails and I have no idea how to avoid this. Maybe you cangive me a hint.
First I create a Item-Item-Similaritymatrix with mahout's taste andin the second step I index it. The matrix is sparce because onlyItem-Item-Relations with a high correlation will be saved.
Here are the Code Snippets for this indexing :
CachedRowSetImpl rowSetMainItemList = null; // Mapping ofItemsArrayList<String> listBelongingItems = null; // Belonging andhighest correlating Items for a MainItem
      Document aDocument = null;
      Field aField = null;
      Field aField1 = null;
            Analyzer aAnalyzer  = new StandardAnalyzer();
IndexWriter aWriter = new IndexWriter(this.indexDirectory,aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
            aWriter.setRAMBufferSizeMB(48);
            rowSetMainItemList = getRowSetItemList(); //get all Items
aField1 = new Field("Item1", "",Field.Store.YES,Field.Index.ANALYZED); // reuse this field
            while (rowSetMainItemList.next()){
                        aDocument = new Document();
aField1.setValue(rowSetMainItemList.getString(1)); aDocument.add(aField1);listBelongingItems = getRowSetBelongingItems(rowSetMainItemList.getString(1)); // get the most similar Items fpra ItemIterator<String> itrBelongingItems =listBelongingItems.iterator();
                        while (itrBelongingItems.hasNext()){
String strBelongingItem = (String)itrBelongingItems.next();//No reuse of Field possible because of differentfieldnames:aField = new Field(strBelongingItem,"1",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);
              aDocument.add(aField);
          }
aWriter.addDocument(aDocument); }
            aWriter.optimize();
      aWriter.close();
                aAnalyzer.close();
Actually the Field of the BelongingItem have to beboosted with the MainItem-BelongingItem-Correlation-Value to getaccurate Recommendations, but here the Index would be about 80 GBytefor 6 million items... without it will only be about 2Gbyte.But under the condition that only relevant Correlations will besaved in the Similaritymatrix the recommendation quality will begood enough.
The item recommendation for a User is a simple BooleanQuery withuserhistory boosted TermQuerys. Here I search for documents with thelargest Correspondence regarding the userhistory. So I look inwhich Documents the most Fields with the name of a BelongingItem areset (with value 1) and recommend the "key"-value which was set inaField1("Item"...)Whatever, as i mentioned it worked for a Number of 100000 Items.But if there are 1 million items the indexing crash after a while with
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      at java.util.HashMap.resize(HashMap.java:462)
      at java.util.HashMap.addEntry(HashMap.java:755)
      at java.util.HashMap.put(HashMap.java:385)
      at java.util.HashSet.add(HashSet.java:200)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:66)at org.apache.lucene.index.DocFieldConsumers.flush(DocFieldConsumers.java:75)at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:60)at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574)at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3540)at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3450)at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1937)at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)
if I increase the Java heap space there will be a"OutOfMemoryError: /PermGen space" /Exception.If I increase the PermGen Space -XX:MaxPermSize=1024m the Java heapspace is still the limiting factor.I can increase both to the maximum of my system - 20Gbyte Ram areavailable - but this doesn't solve the problem.
Through indexing the ram-memory consumtion growing steadily until itchrashes. It does not matter if I index the data in segments withopen and close each time the IndexWriter or if I optimize the indexperiodically - the ram-memory consumtion is still growing ...
I think the problem is, that I can't reuse the field aField for myapproach and it seems the GC doesn't collect it. Extrapolated thats600 Million unique fields...
I'm using lucene 2.4.1 and java version "1.6.0_16".
Do anyone have an idea to avoid the growing memory. Or do somebodyknow an other approche for a "realtime Item based Recommender" withLucene?

You might want to ask on mahout-user, but I'm guessing Ted didn't meana new field for every item-item, but instead to represent them astokens and then create the corresponding appropriate queries (seemslike payloads may be useful, or function queries). That to me is theonly way you would achieve the sparseness savings you are after.


-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Problems with ItemBasedRecommender with Lucene

Reply via email to