Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek Sat, 16 Apr 2005 10:17:39 -0700

On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote:

On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:
So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct?
Right, it has no bearing. A query wouldn't specify any fields, it just uses the implicit default field name.
Cool. My questions regarding how to deal with field names is obviously more an implementation detail under the covers of the match() method than how you want to use it. In a general sense, though, its necessary to deal with default field name, queries that have non-default-field terms, and the analysis process.

Right, I'd just like to first assess rough overall efficiency before tying up some loose ends.

(: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene query goes here :)
Note that "fish*~" is not a valid query expression :)

Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling...

(I love how XQuery uses smiley emoticons for comments) BTW, I have a strong vested interest in seeing a fast and scalable XQuery engine in the open source world. I've toyed with eXist some - it was not stable or scalable enough for my needs. Lot's of Wolfgang's in the XQuery world :)

If you're looking for an XML DB for managing and querying large persistent data volumes, Nux/Saxon will disappoint you. If, on the other hand, you're looking for a very fast XQuery engine inserted into a processing pipeline working with many small to medium sized XML documents (such as messages in a scalable message queue or network router) then you might be pleased.

for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return (<score>{$score}</score>, $book)
Could you avoid calling match() twice here?

That's no problem for two reasons: 1) The XQuery optimizer rewrites the query into an optimized expression tree eliminating redundancies, etc. If for some reason this isn't feasible or legal then 2) There's a smart cache between the XQuery engine and the lucene invocation that returns results in O(1) for Lucene queries that have already been seen/processed before. It caches (queryString,result), plus parsed Lucene queries, plus the Lucene index data structure for any given string text (which currently is a simple RAMDirectory but could be whatever datastructure we come up with as part of the exercise - class StringIndex or some such). This works so well that I have to disable the cache to avoid getting astronomically good figures on artificial benchmarks.

some skeleton:
private static final String FIELD_NAME = "content"; // or whatever - it doesn't matter
        public Query parseQuery(String expression) throws ParseException {
                QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
                return parser.parse(expression);
        }
        private Document createDocument(String content) {
                Document doc = new Document();
                doc.add(Field.UnStored(FIELD_NAME, content));
                return doc;
        }
This skeleton code doesn't really apply to the custom IndexReader implementation. There is a method to return a document from IndexReader, which I did not implement yet in my sample - it'd be trivial though. I don't think you'd need to get a Lucene Document object back in your use case, but for completeness I will add that to my implementation.

Right, it was just to outline that the value of FIELD_NAME doesn't really matter.

There is still some missing trickery in my StringIndexReader - it does not currently handle phrase queries as an implementation of termPositions() is needed.

Wolfgang - will you take what I've done the extra mile and implement what's left (frequency and term position)? I might not revisit this very soon.
I'm not sure I'll be able to pull it off, but I'll see what I can do. If someone more competent would like to help out, let me know... Thanks for all the help anyway, Erik and co, it is greatly appreciated!
If you can build an XQuery engine, you can hack in some basic Java data structures that keep track of word positions and frequency :)

There's a learning curve ahead of me, not having working before at that low-level with Lucene :-) Mark Harwood sent me some good but somewhat unfinished code he wrote previously for similar scenarios. I'll look into merging his pieces and your skeleton.

By now I'm quite confident this can be done reasonably efficient. BTW, I have some small performance patches for FastCharStream and in various other places, but I'll hold off proposing those until our exercise is done and the real merits/drawbacks of those patches can be better assessed.

I'll tinker with it some more for fun in the near future, but anyone else is welcome to flesh out the missing pieces.


Thanks again for the kind helping out!
Wolfgang.


-----------------------------------------------------------------------
Wolfgang Hoschek                  |   email: [EMAIL PROTECTED]
Distributed Systems Department    |   phone: (415)-533-7610
Berkeley Laboratory               |   http://dsd.lbl.gov/~hoschek/
-----------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance] Streaming main memory indexing of single strings

Reply via email to