Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher Sat, 16 Apr 2005 02:58:54 -0700

On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:

So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct?
Right, it has no bearing. A query wouldn't specify any fields, it just uses the implicit default field name.

Cool. My questions regarding how to deal with field names is obviously more an implementation detail under the covers of the match() method than how you want to use it. In a general sense, though, its necessary to deal with default field name, queries that have non-default-field terms, and the analysis process.

(: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene query goes here :)

Note that "fish*~" is not a valid query expression :) (I love how XQuery uses smiley emoticons for comments) BTW, I have a strong vested interest in seeing a fast and scalable XQuery engine in the open source world. I've toyed with eXist some - it was not stable or scalable enough for my needs. Lot's of Wolfgang's in the XQuery world :)

for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return (<score>{$score}</score>, $book)


Could you avoid calling match() twice here?

some skeleton:

private static final String FIELD_NAME = "content"; // or whatever - it doesn't matter

        public Query parseQuery(String expression) throws ParseException {
                QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
                return parser.parse(expression);
        }

        private Document createDocument(String content) {
                Document doc = new Document();
                doc.add(Field.UnStored(FIELD_NAME, content));
                return doc;
        }

This skeleton code doesn't really apply to the custom IndexReader implementation. There is a method to return a document from IndexReader, which I did not implement yet in my sample - it'd be trivial though. I don't think you'd need to get a Lucene Document object back in your use case, but for completeness I will add that to my implementation.

There is still some missing trickery in my StringIndexReader - it does not currently handle phrase queries as an implementation of termPositions() is needed.

Wolfgang - will you take what I've done the extra mile and implement what's left (frequency and term position)? I might not revisit this very soon.
I'm not sure I'll be able to pull it off, but I'll see what I can do. If someone more competent would like to help out, let me know... Thanks for all the help anyway, Erik and co, it is greatly appreciated!

If you can build an XQuery engine, you can hack in some basic Java data structures that keep track of word positions and frequency :)

I'll tinker with it some more for fun in the near future, but anyone else is welcome to flesh out the missing pieces.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance] Streaming main memory indexing of single strings

Reply via email to