If you'd consider using a MemoryIndex for this, I'd recommend also having a look at nux.xom.pool.FullTextUtil and nux.xom.pool.FullTextPool, adding smart caching for indexes, queries and results on top of a MemoryIndex. With some luck this (or some variant of it) could help speed up your use cases, at least as far as I gather.

[It's part of the Nux download]

Wolfgang.

Snippet from the javadoc:

/**
* Thread-safe XQuery/XPath fulltext search utilities; implemented with the
 * Lucene engine and a custom high-performance adapter for
* on-the-fly main memory indexing with smart caching for indexes, queries and results.
 * <p>
 * Complementing the standard XPath string and regular
* expression matching functionality, Lucene has a powerful query syntax with support * for word stemming, fuzzy searches, similarity searches, approximate searches, * boolean operators, wildcards, grouping, range searches, term boosting, etc.
 * For details see the <a target="_blank"
* href="http://lucene.apache.org/java/docs/ queryparsersyntax.html">Lucene Query
 * Syntax and Examples</a>.
 * Also see [EMAIL PROTECTED] org.apache.lucene.index.memory.MemoryIndex}
 * and [EMAIL PROTECTED] PatternAnalyzer} for detailed documentation.
 * <p>
 * Example Java usage:
 * <pre>
 * Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
 * float score = FullTextUtil.match(
 *    "Readings about Salmons and other select Alaska fishing Manuals",
 *    "+salmon~ +fish* manual~",
 *    analyzer, analyzer);
 * if (score &gt; 0.0f) {
 *     // query matches text
 * } else {
 *     // query does not match text
 * }
 * </pre>


On Jan 4, 2006, at 6:03 AM, karl wettin wrote:

Hello list,

I wrote a search agent thingy for Lucene. It was built to handle huge amounts of agents.

Rather than one query per agent to find out if the new document is interesting or not, agent trigger queries are stored in an index that is queried with the tokens of a new document.

Since it uses the index a bit backwards the agent trigger queries are somewhat limited:

At least one token in a OR or FUZZY OR per agent field must match the new document.
Any NOT token in agent must not match the new document.

It is fairly easy to add more query types, but is limited to single token and non-wildcard types since the query if created from the new document tokens.

Agents are clustered by required fields by agent, and each cluster is stored in an own index. When a new document is sent to the AgentManager it creates one query per possible cluster. I'm not sure this actually speeds things up, just a gut feeling.

Example agents in psuedo trigger query language:

Possible agent:

AND (OR ("category","media"))
AND (OR ("name", "hotel") OR ("name","rowanda"))
AND (NOT("name", "paradise"))

Impossible agent:

AND (OR ("category","media"))
AND (("name", "hotel") AND ("name","rowanda"))
AND (NOT("name", "paradise"))

In effect the agents can't trigger on AND queries of the same field.

One could of couse place a more complex query on the new document as the agent triggers, use some classifier or whatever if speed is not a big deal. The agent triggers could then be built from the original query. I probably won't implement such a thing my self.

Should I post the code to the sandbox when I've tested it? Are there any restrictions to the code if I do that?

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to