Re: Problems with ItemBasedRecommender with Lucene
Oh, I overlooked the simplest way to do that. You're right, tokens are the key to this problem. It works pretty well. It would be perfect if I use payloads. I read your advice http://www.lucidimagination.com/blog/category/payloads/. You store the payloads with your PayLoadAnalyzer in this way: //Store both position and offset information Field text = new Field("body", DOCS[i], Field.Store.NO, Field.Index.ANALYZED); Is there a chance to use Field.Index.ANALYZED_NO_NORMS because otherwise my index would be much to big or are normes necessary for Payloads? You use Lucene 2.9 is there a way to do this with Lucene 2.4.1 because I can't find e.g. the "PayloadEncoder" or do I have to wait for the release? Regards Thomas You might want to ask on mahout-user, but I'm guessing Ted didn't mean a new field for every item-item, but instead to represent them as tokens and then create the corresponding appropriate queries (seems like payloads may be useful, or function queries). That to me is the only way you would achieve the sparseness savings you are after. -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search Hello, I build a "real time ItemBasedRecommender" based on a users history and a (sparse) item similarity matrix with lucene. Some time ago Ted Dunning recommended me this approach at the mahout mailing list to create a ItemBasedRecommender: "It is actually very easy to do. The output of the recommendation off-line process is generally a sparse matrix of item-item links. Each line of this sparse matrix can be considered a document in creating a Lucene index. You will have to use a correct analyzer and a line by line document segmenter, but that is trivial. Then recommendation is a simple query step." So for 10 items it works fine - but for 1 million items the Indexing fails and I have no idea how to avoid this. Maybe you can give me a hint. First I create a Item-Item-Similaritymatrix with mahout's taste and in the second step I index it. The matrix is sparce because only Item-Item-Relations with a high correlation will be saved. Here are the Code Snippets for this indexing : CachedRowSetImpl rowSetMainItemList = null; // Mapping of Items ArrayList listBelongingItems = null; // Belonging and highest correlating Items for a MainItem Document aDocument = null; Field aField = null; Field aField1 = null; Analyzer aAnalyzer = new StandardAnalyzer(); IndexWriter aWriter = new IndexWriter(this.indexDirectory, aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); aWriter.setRAMBufferSizeMB(48); rowSetMainItemList = getRowSetItemList(); //get all Items aField1 = new Field("Item1", "", Field.Store.YES,Field.Index.ANALYZED); // reuse this field while (rowSetMainItemList.next()){ aDocument = new Document(); aField1.setValue(rowSetMainItemList.getString(1)); aDocument.add(aField1); listBelongingItems = getRowSetBelongingItems(rowSetMainItemList.getString(1)); // get the most similar Items fpr a Item Iterator itrBelongingItems = listBelongingItems.iterator(); while (itrBelongingItems.hasNext()){ String strBelongingItem = (String) itrBelongingItems.next(); //No reuse of Field possible because of different fieldnames: aField = new Field(strBelongingItem,"1", Field.Store.NO,Field.Index.ANALYZED_NO_NORMS); aDocument.add(aField); } aWriter.addDocument(aDocument);} aWriter.optimize(); aWriter.close(); aAnalyzer.close(); Actually the Field of the BelongingItem have to be boosted with the MainItem-BelongingItem-Correlation-Value to get accurate Recommendations, but here the Index would be about 80 GByte for 6 million items... without it will only be about 2Gbyte. But under the condition that only relevant Correlations will be saved in the Similaritymatrix the recommendation quality will be good enough. The item recommendation for a User is a simple BooleanQuery with userhistory boosted TermQuerys. Here I search for documents with the largest Correspondence regarding the userhistory. So I look in which Documents the most Fields with the name of a BelongingItem are set (with value 1) and recommend the "key"-value which was set in aField1("Item"...) Whatever, as i mentioned it worked for a Number of 10 Items. But if there are 1 million items the indexing crash after a while with Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.resize(HashMap.
Re: Problems with ItemBasedRecommender with Lucene
On Sep 17, 2009, at 5:06 AM, Thomas Rewig wrote: Oh, I overlooked the simplest way to do that. You're right, tokens are the key to this problem. It works pretty well. It would be perfect if I use payloads. I read your advice http://www.lucidimagination.com/blog/category/payloads/ . You store the payloads with your PayLoadAnalyzer in this way: //Store both position and offset information Field text = new Field("body", DOCS[i], Field.Store.NO, Field.Index.ANALYZED); Is there a chance to use Field.Index.ANALYZED_NO_NORMS I don't see why not. because otherwise my index would be much to big or are normes necessary for Payloads? You use Lucene 2.9 is there a way to do this with Lucene 2.4.1 because I can't find e.g. the "PayloadEncoder" or do I have to wait for the release? I'd bet that patch wouldn't be too hard to backport, since it lives in contrib/analyzers. All it does anyway is give a generic notion to adding a payload based on a data type. Payloads are in 2.4.1 and all they are is a byte array, so it should be easy enough to write a simple Token Filter that does what you want. Regards Thomas You might want to ask on mahout-user, but I'm guessing Ted didn't mean a new field for every item-item, but instead to represent them as tokens and then create the corresponding appropriate queries (seems like payloads may be useful, or function queries). That to me is the only way you would achieve the sparseness savings you are after. -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search Hello, I build a "real time ItemBasedRecommender" based on a users history and a (sparse) item similarity matrix with lucene. Some time ago Ted Dunning recommended me this approach at the mahout mailing list to create a ItemBasedRecommender: "It is actually very easy to do. The output of the recommendation off-line process is generally a sparse matrix of item-item links. Each line of this sparse matrix can be considered a document in creating a Lucene index. You will have to use a correct analyzer and a line by line document segmenter, but that is trivial. Then recommendation is a simple query step." So for 10 items it works fine - but for 1 million items the Indexing fails and I have no idea how to avoid this. Maybe you can give me a hint. First I create a Item-Item-Similaritymatrix with mahout's taste and in the second step I index it. The matrix is sparce because only Item-Item-Relations with a high correlation will be saved. Here are the Code Snippets for this indexing : CachedRowSetImpl rowSetMainItemList = null; // Mapping of Items ArrayList listBelongingItems = null; // Belonging and highest correlating Items for a MainItem Document aDocument = null; Field aField = null; Field aField1 = null; Analyzer aAnalyzer = new StandardAnalyzer(); IndexWriter aWriter = new IndexWriter(this.indexDirectory, aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); aWriter.setRAMBufferSizeMB(48); rowSetMainItemList = getRowSetItemList(); //get all Items aField1 = new Field("Item1", "", Field.Store.YES,Field.Index.ANALYZED); // reuse this field while (rowSetMainItemList.next()){ aDocument = new Document(); aField1.setValue (rowSetMainItemList.getString(1)); aDocument.add (aField1); listBelongingItems = getRowSetBelongingItems (rowSetMainItemList.getString(1)); // get the most similar Items fpr a Item Iterator itrBelongingItems = listBelongingItems.iterator(); while (itrBelongingItems.hasNext()){ String strBelongingItem = (String) itrBelongingItems.next(); //No reuse of Field possible because of different fieldnames: aField = new Field(strBelongingItem,"1", Field.Store.NO,Field.Index.ANALYZED_NO_NORMS); aDocument.add(aField); } aWriter.addDocument (aDocument);} aWriter.optimize(); aWriter.close(); aAnalyzer.close(); Actually the Field of the BelongingItem have to be boosted with the MainItem-BelongingItem-Correlation-Value to get accurate Recommendations, but here the Index would be about 80 GByte for 6 million items... without it will only be about 2Gbyte. But under the condition that only relevant Correlations will be saved in the Similaritymatrix the recommendation quality will be good enough. The item recommendation for a User is a simple BooleanQuery with userhistory boosted TermQuerys. Here I search for documents with the largest Correspondence regarding the userhistory.
Re: Displaying search result data - stored fields vs external source
Hello , I would also prefer to store the content in the index because, as Erick points out this leads to a more simplified design but also because it allows me to preserve the relevance sort. If you store only the item id in the index then when extracting all the other required data from supposedly a database, you will probably execute a “*select * from item where id in (id_1,id_2,id_3...)*” which will probably not retain your relevance sort. So unless you sort by a business field or apply some kind of convoluted sort strategy which maps to your original Lucene ResultSet you will have lost your ranking. Cheers, savvas 2009/9/15 Erick Erickson > Categorically I store everything in the index unless/until I *know* it > doesn'twork. With some things, it's easy to know from the outset, like if I > have > 20T of data to store. > > First, storing fields has minimal impact on the search speed, the stored > text > isn't interleaved with the search tokens, so they're pretty much disjoint. > > Second, any scheme storing data separately is inherently more complex > and difficult to maintain. From the eXtreme Programming folks "Do the > simplest thing that could possibly work". > > Third, there isn't much work in trying it and seeing. I mean you have to > write > the retrieval code, and if you encapsulate fetching the data you can switch > it out later if it comes to that pretty easily. So you don't lose much at > all > by "just trying it" .. > > HTH > Erick > > On Tue, Sep 15, 2009 at 4:19 AM, Joel Halbert > wrote: > > > Hi, > > > > When using Lucene I always consider two approaches to displaying search > > result data to users: > > > > 1. Store any fields that we index and display to users in the Lucene > > Documents themselves. When we perform a search simply retrieve the data > > to be displayed from the Lucence documents themselves. > > > > or > > > > 2. Index fields in Lucene but reference data to be displayed from > > another source, such as a database. So, when searching I would search > > for documents then use a (stored) reference key on the documents to then > > lookup the display fields to display from another source e.g. a > > database. > > > > With regards to the number and size of stored fields I am looking at > > indexing and displaying approximately 4 relatively small fields for each > > document (e.g. name, age, short description, URL ~ approx 500bytes in > > total). In any query about 10 hits will be displayed to the user. > > Approximately 10 million documents to index and search. > > > > I am interested the differences in both approaches with regards to: > > > > 1) Indexing time performance (how long it might take to index with and > > without stored fields) > > 2) Search time performance (total time taken to search for matching > > documents and then display fields to users) > > > > I am less interested in differences arising from > > maintainability/increased storage requirements. > > > > I would be interested to see what others think of using each approach. > > > > Cheers, > > Joel > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >
Re: Counting search results
Hello, I have tried your method, but it doesn't work. set will be null after applying BitSet set = filter.bits(reader); I haven't found any reason for this. Additionally, the bits method is deprecated and it is mentioned to use "getDocIdSet". But this set does only provide an iterator, no hash checks are possible. Are there any other possibilities to improve speed? Mathias Am 15.09.2009 17:13 schrieb Simon Willnauer : > Hmm, so if you wanna use the Filter to narrow down the search results > > you could use it in the while loop like this: > > > > BitSet set = filter.bits(reader); > > int numDocs > > TermDocs termDocs = reader.termDocs(new Term("myField", "myTerm")); > > while (termDocs.next()) { > > if(set.get(termDocs.doc())) > > numDocs++; > > } > > > > would that help? > > > > simon > > >> > > On Tue, Sep 15, 2009 at 5:01 PM, Mathias Bank mathias.b...@gmail.com> wrote: > > > Hello, > > > > > > This seams to be a similar solution like: > > > > > > Term t = new Term(fieldname, term); > > > int count = searcher.docFreq(t); > > > > > > The problem is, that in this situation it is not possible to apply a > > > filter object. If I don't wanna use this filter object, I would have > > > to use a complex search query, wich is - again - very slow. So, > > > unfortunatelly, your solution does not help. > > > > > > Mathias > > > > > > 2009/9/15 Simon Willnauer simon.willna...@googlemail.com>: > > >> Did you try: > > >> int numDocs > > >> TermDocs termDocs = reader.termDocs(new Term("myField", "myTerm")); > > >> while (termDocs.next()) { numDocs++; } > > >> > > >> simon > > >> > > >> On Tue, Sep 15, 2009 at 2:19 PM, Mathias Bank mathias.b...@gmail.com> > >> wrote: > > >>> Hello, > > >>> > > >>> I'm trying to find the number of documents for a specific term to > > >>> create text statistics. I'm not interested in ordering the results or > > >>> even recieving the first result. I just need the number of results. > > >>> > > >>> Currently, I'm trying to do this by using the lucene searcher class: > > >>> > > >>> IndexSearcher searcher = new IndexSearcher(reader); > > >>> String queryString = fieldname+":" + term; > > >>> QueryParser parser = new QueryParser(fieldname, new GermanAnalyzer()); > > >>> TopDocs d = searcher.search(parser.parse(queryString), filter, 1); > > >>> int count = d.totalHits; > > >>> > > >>> The problem is, that there is a large index (optimized) with > 8 mio. > > >>> entries. One search could return a large number of search results (> 1 > > >>> mio). Currently these search tasks take more than 15 secunds. > > >>> > > >>> The question is: is there any way to get the number of search results > > >>> faster? I think, that it could be optimized by not using a Weight > > >>> object (order is not interesting), but I haven't seen a way to do > > >>> this. > > >>> > > >>> I hope, someone has already solved this problem. > > >>> > > >>> Mathias > > >>> > > >>> - > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>> > > >>> > > >> > > >> - > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to perform a phrase "begins with" query?
Hi all, Since you can't (and it doesn't make sense to) use wildcards in phrase queries, how do you construct a query to get results for phrases that begin with a certain set of terms? Here are some theoretical examples... Example 1 - I have an index where each document contains the contents of short stories. I want to return each document that begins with the words "Once upon a time". I know this in not valid Lucene syntax, but what I would like to do is query for "Once upon a time"* Example 2 - I have an index where each document contains numbered test resultssay test 1 - test 5000. I want to return each document where the test starts with the number 5. So the query here would be (again I know this isn't valid) something like "test 5"* How can this be accomplished? Thanks Paul
Re: How to perform a phrase "begins with" query?
> Since you can't (and it doesn't make sense to) use > wildcards in phrase > queries, how do you construct a query to get results for > phrases that begin with a certain set of terms? > Here are some theoretical examples... > > > Example 1 - I have an index where each document contains > the contents of > short stories. I want to return each document that > begins with the > words "Once upon a time". I know this in not valid > Lucene syntax, but > what I would like to do is query for "Once upon a time"* You are trying to retrieve documents begins with "Once upon a time", right? You want your phrase in the beginning of the document. You can retrieve them using SpanQuery family programmatically. I am not sure about the value of (int end) in SpanFirstQuery constructor but it will be something like that: s1 = new SpanTermQuery(new Term("story","once")); s2 = new SpanTermQuery(new Term("story","upon")); s3 = new SpanTermQuery(new Term("story","time")); s4 = new SpanNearQuery([s1,s2,s3], 0, true); s5 = new SpanFirstQuery(s4, 3); Note that you need to use analyzed text of terms in this approach. Hope this helps. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: New "Stream closed" exception with Java 6 - solved
: It turns out that the cause of the exceptions is in fact adding an item : twice - so you were correct right at the start :-) I ran a test where I glad to see it all worked out. : Just a minor point: isn't Lucence in a position to detect the duplicate : insertion attempt and flag it with something less vague than "Stream : closed"? :-) not really ... adding a document multiple times is a perfectly legal use case, adding a document with a "Reader" based field where the reader is already closed ... that's not legal (And Lucene doesn't really have any way of knowing if the Reader is closed because *it* closed it. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to perform a phrase "begins with" query?
Since you can't (and it doesn't make sense to) use wildcards in phrase queries, You can with this: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/misc/src/java/org/apache/lucene/queryParser/complexPhrase/ Discussion here: http://tinyurl.com/lrnage Cheers, Mark
Re: Combining hits from multiple documents into a single hit
Assuming i understand you correctly, then... 1. properties only exist as part of a single article (no articles share a complex property) 2. you don't have any need to ever return searchese on properties, they exist just to add in searching for articles. IF that's correct, then the idea i would try is to only index 1 document per article, with all of the text included, and use payloads to annotate which text is securred by which property. Then use SpanQueries to search for your docs, and in a custom HitCollector check the matching spans for each doc to get the corrisponding property, and test that tripple agaisnt your security mechanism -- if any fail, skip that doc. It's not something i've ever tried (or thought through very hard) but based on other comments i've seen from people about payloads it sounds like it should work pretty well and give you decent scores. : [I originally posted this to the Lucene.net mailing list,but it was suggested : that I might have more luck here] : : I am trying to get a particular search to work and it is proving problematic. : The actual source data is quite complex but can be summarised by the following : example: : : I have articles that are indexed so that they can be searched. Each article : also has multiple properties associated with it which are also indexed and : searchable. When users search, they can get hits in either the main article or : the associated properties. Regardless of where a hit is achieved, the article : is returned as a search hit (ie. the properties are never a hit in their own : right). : : Now for the complexity: : : Each property has security on it, which means that for any given user, they : may or may not be able to see the property. If a user cannot see a property, : they obviously do not get a search hit in it. This security check is : proprietary and cannot be done using the typical mechanism of storing a role : in the index alongside the other fields in the document. : : I currently have a index that contains the articles and properties indexed : separately (ie. an article is indexed as a document, and each property has its : own document). When a search happens, a hit in article A or a hit in any of : the properties of article A should be classed as hit for article A alone, with : the scores combined. : : Whether or not a user can see a property is not based on the property itself, : but on the value of the property. I cannot therefore put the extra security : conditions into the query upfront as I don't know the value to filter by. : : As an example: : : +-+++ : | Article | Property 1 | Property 2 | : +-+++ : |A| X | J | : |B| Y | K | : |C| Z | L | : +-+++ : : If a user can see everything, then searching for "B and Y" will return a : single search result for article B. : : If another user cannot see a property if its value contains Y, then searching : for "B and Y" will return no hits. : : I have no way of knowing what values a user can and cannot see upfront. They : only way to tell is to perform the security check (currently done at the time : of filtering a hit from a field in the document), which I obviously cannot do : for every possible data value for each user. : : To achieve this originally, Lucene v1.3 was modified to allow this to happen : by changing BooleanQuery to have a custom Scorer that could apply the logic of : the security check and the combination of two hits in different documents : being classed as a hit in a single document. I am trying to upgrade this : version to the latest (v2.3.2 - I am using Lucene.Net), but ideally without : having to modify Lucene in any way. : : An additional problem occurs if I do an AND search. If an article contains the : word foo and one of its properties contains the word bar, then searching for : "foo AND bar" will return the article as a hit. My current code deals with : this inside the custom Scorer. : : Any ideas how/if this can be done? : : I am thinking along the lines of using a custom HitCollector and passing that : into the search, but when doing the boolean search "foo AND bar", execution : never reaches my HitCollector as the ConjunctionScorer filters out all of the : results from the sub-queries before getting there. : : Thanks, : : Adrian : : : : - : To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org : For additional commands, e-mail: java-user-h...@lucene.apache.org -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Filtering question/advice
FWWI: a test case with multiple asserts is more useful if you clarify where it failes ... ie: show us the failure message, or put a comment on athe line of the assert that fails. i didn't run your testcase, but skimming it a few things jumpt out at me that might explain whatever problem you are seeing... :Field uw1 = new Field("uw-refernce", "hello", Field.Store.NO, : Field.Index.ANALYZED); :Field uw2 = new Field("uw-refernce", "bye", Field.Store.NO, : Field.Index.ANALYZED); ... :layerDocumentA = new Document(); :layerDocumentA.add(uw1); :layerDocumentA.add(uw1); ...did you really mean to add uw1 twice? or did you mean to add uw2 as well (it's never used)... : public void testUWBCanSeeResultIfSearchTermMatchesOnSomethingElse() : throws Exception { ... : UnderwriterReferenceFilter filter = new : UnderwriterReferenceFilter(); ...you never set any properties on this Filter before you use it. reading it's implementation, that should cause an IllegalArgumentException. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org