On Thursday 17 March 2005 09:20, David Spencer wrote:
> Robert Watkins wrote:
> 
> > The reason your suggestion is not practical is scalability. In a 
production
> > environment you might have, for example, 10,000 stored queries and 10 new
> > documents a minute. That's a fair bit of load on the system for only one
> > aspect of a much larger search application.
> 
> Fun, interesting question - maybe you could elaborate on the 
> requirements a bit.
> 
> I believe the basic kind of use case here is something like "Google 
> alerts", only more time sensitive - a large number of queries sitting 
> around waiting to execute against incoming news feeds...
> 
> So:
> 
> Number of stored queries: could be 10,000 (thus a "large" number)
> Incoming docs: 10/min (thus not a huge load)
> 
> Questions:
> 
> - How complicated are the queries? Are they essentially a list of words 
> ANDed together, or are they generalized queries against multiple fields 
> with things like fuzzy term expansion and phrase matches allowed?
> 
> - How big are the incoming documents?
> 
> =====
> 
> I believe the general goal is to avoid looping thru all 10k queries 
> every time a doc comes in.
> 
> These approaches come to mind:
> 
> [a] Store queries in memory, and batch incoming documents into a 
> RAMDirectory, and do the dumb loop thru all 10k queries every minute, 
> and then destrory the RAMDirectory. Good news pure Lucene and easy to 
> code, bad news it has the dreaded loop thru all 10k queries.

If the order of the queries can be chosen by the implementation, the
speed of this loop can be improved by trying to order the queries such
that the next query has terms in common with the previous one(s).

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to