Replying to myself, after an IRC chat the problem got clarified: the scale problem is not the amount of data to match against but the amount of Queries being registered in the system, to which the new Document needs to be matched.
Assuming we can store the Queries as Lucene Queries in the grid as instances (you'll need to figure some way to serialize them, but that should be easy since tracking how you create them), you index the Document not in the usual Lucene index, but create an instance of an org.apache.lucene.index.memory.MemoryIndex. There is a full example in the javadocs: http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html Keep in mind this is not included in the Lucene core jar, you'll need the additional dependency "lucene-memory". I guess the rest is trivial.. wrap it into a Map/Reduce task and have it fire against all stored queries in the grid. Sorry for the confusion I had in the first answer. Cheers, Sanne On 19 June 2012 20:40, Sanne Grinovero <sa...@infinispan.org> wrote: > Hi Ales, > > there are several strategies, what might work best depends on several > factors, not least on how many queries, index size, how much memory we > can dedicate for query caches, and what the ratio of updates is. > > A Lucene Query produces a sparse BitSet, you can think of it as an > ordered list of matching ids, and a common use case is to wrap this > BitSet as a Filter so that it can be cached, reused and applied as > mask on other queries. > > Assuming your set of predefined queries is rather limited, you can > cache all these BitSets, and when you deal with a specific document, > you search for it by "primary key" in the index (which is a very > efficient query), so you get what identifier it has (as index in the > bitset), and then you just look which queries are having a match. > > The good is that reusing those BitSets is very efficient, the bad news > is that you have to rebuild some part of each BitSets (average of 10% > with default configurations) every time an index update is applied. > As a consequence, if what you need to do is list which queries match > for every document you *insert* - compared to just reads - > this is going to be an expensive approach. > > Are you going to need this both for a Map/Reduce Query and a Lucene > Query, or are you just implying that both approaches would be fine for > you? > > Do you have a practical example of such a Query? I'm wondering if > you're looking for features like MoreLikeThis or tagging suggestions, > which can be implemented more efficiently in different ways. > > Sanne > > On 19 June 2012 18:58, Ales Justin <ales.jus...@gmail.com> wrote: >> @Sanne, Vladimir: a think-task for you two :) >> >> With CapeDwarf we need the following feature -- just the opposite from query >> results. >> A user has a document, and a set of pre-defined queries. >> Now we need to see which queries match the given document. >> >> A dummy impl is to iterate over queries and find the ones that match. >> But, this is of course not scalable. >> >> Any idea / suggestion on how to prepare Infinispan Query together with >> Distributed Execution framework to handle such feature? >> >> -Ales >> >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev@lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev