Re: [Performance] Streaming main memory indexing of single strings
Applied!! Erik On May 3, 2005, at 1:31 PM, Wolfgang Hoschek wrote: Here's a performance patch for MemoryIndex.MemoryIndexReader that caches the norms for a given field, avoiding repeated recomputation of the norms. Recall that, depending on the query, norms() can be called over and over again with mostly the same parameters. Thus, replace public byte[] norms(String fieldName) with the following code: /** performance hack: cache norms to avoid repeated expensive calculations */ private byte[] cachedNorms; private String cachedFieldName; private Similarity cachedSimilarity; public byte[] norms(String fieldName) { byte[] norms = cachedNorms; Similarity sim = getSimilarity(); if (fieldName != cachedFieldName || sim != cachedSimilarity) { // not cached? Info info = getInfo(fieldName); int numTokens = info != null ? info.numTokens : 0; float n = sim.lengthNorm(fieldName, numTokens); byte norm = Similarity.encodeNorm(n); norms = new byte[] {norm}; cachedNorms = norms; cachedFieldName = fieldName; cachedSimilarity = sim; if (DEBUG) System.err.println ("MemoryIndexReader.norms: " + fieldName + ":" + n + ":" + norm + ":" + numTokens); } return norms; } The effect can be substantial when measured with the profiler, so it's worth it. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Here's a performance patch for MemoryIndex.MemoryIndexReader that caches the norms for a given field, avoiding repeated recomputation of the norms. Recall that, depending on the query, norms() can be called over and over again with mostly the same parameters. Thus, replace public byte[] norms(String fieldName) with the following code: /** performance hack: cache norms to avoid repeated expensive calculations */ private byte[] cachedNorms; private String cachedFieldName; private Similarity cachedSimilarity; public byte[] norms(String fieldName) { byte[] norms = cachedNorms; Similarity sim = getSimilarity(); if (fieldName != cachedFieldName || sim != cachedSimilarity) { // not cached? Info info = getInfo(fieldName); int numTokens = info != null ? info.numTokens : 0; float n = sim.lengthNorm(fieldName, numTokens); byte norm = Similarity.encodeNorm(n); norms = new byte[] {norm}; cachedNorms = norms; cachedFieldName = fieldName; cachedSimilarity = sim; if (DEBUG) System.err.println("MemoryIndexReader.norms: " + fieldName + ":" + n + ":" + norm + ":" + numTokens); } return norms; } The effect can be substantial when measured with the profiler, so it's worth it. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Thanks! Wolfgang. I've committed this change after it successfully worked for me. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On May 2, 2005, at 5:21 PM, Wolfgang Hoschek wrote: Finally found and fixed the bug! The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() with the following: public boolean skipTo(int target) { if (DEBUG) System.err.println(".skipTo: " + target); return next(); } Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the bug, while SVN does. I've committed this change after it successfully worked for me. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
The version I sent returns in O(1), if performance was your concern. Or did you mean something else? Since 0 is the only document number in the index, a return target == 0; might be nice for skipTo(). It doesn't really help performance, though, and the next() works just as well. Regards, Paul Elschot. It's not just "return target == 0". Internally next() switches a hasNext flag to false, and that makes it a safer operation... BTW, did you give the unit tests a shot? Or even better, run it against some of your own queries/test data? That might help to shake out other bugs that might potentially be lurking in remote corners... Cheers, Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Monday 02 May 2005 23:38, Wolfgang Hoschek wrote: > > Yes, the svn trunk uses skipTo more often than 1.4.3. > > > > However, your implementation of skipTo() needs some improvement. > > See the javadoc of skipTo of class Scorer: > > > > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ > > Scorer.html#skipTo(int) > > What's wrong with the version I sent? Remeber that there can be at most > one document in a MemoryIndex. So the "target" parameter can safely be > ignored, as far as I can see. Correct, I did not realize that there is only a single doc in the index. > > > > > In case the underlying scorers provide skipTo() it's even better to > > use that. > > > > The version I sent returns in O(1), if performance was your concern. Or > did you mean something else? Since 0 is the only document number in the index, a return target == 0; might be nice for skipTo(). It doesn't really help performance, though, and the next() works just as well. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Yes, the svn trunk uses skipTo more often than 1.4.3. However, your implementation of skipTo() needs some improvement. See the javadoc of skipTo of class Scorer: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ Scorer.html#skipTo(int) What's wrong with the version I sent? Remeber that there can be at most one document in a MemoryIndex. So the "target" parameter can safely be ignored, as far as I can see. In case the underlying scorers provide skipTo() it's even better to use that. The version I sent returns in O(1), if performance was your concern. Or did you mean something else? Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Wolfgang, On Monday 02 May 2005 23:21, Wolfgang Hoschek wrote: > Finally found and fixed the bug! > The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() > with the following: > > public boolean skipTo(int target) { > if (DEBUG) System.err.println(".skipTo: > " + target); > return next(); > } > > Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the > bug, while SVN does. Yes, the svn trunk uses skipTo more often than 1.4.3. However, your implementation of skipTo() needs some improvement. See the javadoc of skipTo of class Scorer: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Scorer.html#skipTo(int) In case the underlying scorers provide skipTo() it's even better to use that. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Finally found and fixed the bug! The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() with the following: public boolean skipTo(int target) { if (DEBUG) System.err.println(".skipTo: " + target); return next(); } Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the bug, while SVN does. I now ran the tests over a much larger set of documents and all tests pass. Give it a shot :-) Wolfgang. On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote: I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang. The test case is failing (type "ant test" at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav a:305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes t.java:228) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
This is what I have as scoring calculation, and it seems to do exactly what lucene-1.4.3 does because the tests pass. public byte[] norms(String fieldName) { if (DEBUG) System.err.println("MemoryIndexReader.norms: " + fieldName); Info info = getInfo(fieldName); int numTokens = info != null ? info.numTokens : 0; byte norm = Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName, numTokens)); return new byte[] {norm}; } public void norms(String fieldName, byte[] bytes, int offset) { if (DEBUG) System.err.println("MemoryIndexReader.norms: " + fieldName + "*"); byte[] norms = norms(fieldName); System.arraycopy(norms, 0, bytes, offset, norms.length); } private Similarity getSimilarity() { return searcher.getSimilarity(); // this is the normal lucene IndexSearcher } Can anyone see what's wrong with it for lucene current SVN? Should my calculation now be done differently? If so, how? Thanks for any clues into the right direction. Wolfgang. On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote: I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang. The test case is failing (type "ant test" at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav a:305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes t.java:228) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang. The test case is failing (type "ant test" at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java :305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest .java:228) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On May 1, 2005, at 10:20 PM, Wolfgang Hoschek wrote: I've uploaded code that now runs against the current SVN, plus junit test cases, plus some minor internal updates to the functionality itself. For details see http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 Be prepared for the testcases to take some minutes to complete - don't hit CTRL-C :-) Erik, if nobody objects, can you please put this into a contrib area, e.g. module "memory" in org.apache.lucene.index.memory, or similar? I have committed it into contrib/memory. I made a few minor tweaks such as 2005 for year in license header, putting package statement above license, and adjusting the paths in the test case to match our standard src/test and src/java structure. The test case is failing (type "ant test" at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java: 305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest. java:228) Your conversion to a JUnit test case was not quite what I had in mind :) You simply wrapped your main() into a testMany method. But it is fine for now as it is easily converted into more granular testXXX methods that use the JUnit assert* methods. The paths to test files will likely need to be parameterized and passed in from Ant's task via system properties in order to run correctly regardless of working directory. These things are easily tweaked though and not worth holding back the initial commit. Again, I'm impressed with your level of javadocs and thoroughness in the code. Good stuff! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
I've uploaded code that now runs against the current SVN, plus junit test cases, plus some minor internal updates to the functionality itself. For details see http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 Be prepared for the testcases to take some minutes to complete - don't hit CTRL-C :-) Erik, if nobody objects, can you please put this into a contrib area, e.g. module "memory" in org.apache.lucene.index.memory, or similar? Thanks, Wolfgang. On Apr 27, 2005, at 10:30 AM, Erik Hatcher wrote: On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me one last round up updates (JUnit tests instead of main() and upgrade it to work with trunk) I'll do that. I had put it in miscellaneous but will create its only sub-contrib area instead. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Yes. In fact, I moved the last relevant piece (sandbox/contributions/miscellaneous) to contrib last night. I think both the parsers and XML-Indexing-Demo found in the sandbox are not worth preserving. Anyone feel that these pieces left in the sandbox should be preserved? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
OK. I'll send an update as soon as I get round to it... Wolfgang. On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me one last round up updates (JUnit tests instead of main() and upgrade it to work with trunk) I'll do that. I had put it in miscellaneous but will create its only sub-contrib area instead. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Yes. In fact, I moved the last relevant piece (sandbox/contributions/miscellaneous) to contrib last night. I think both the parsers and XML-Indexing-Demo found in the sandbox are not worth preserving. Anyone feel that these pieces left in the sandbox should be preserved? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me one last round up updates (JUnit tests instead of main() and upgrade it to work with trunk) I'll do that. I had put it in miscellaneous but will create its only sub-contrib area instead. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Yes. In fact, I moved the last relevant piece (sandbox/contributions/miscellaneous) to contrib last night. I think both the parsers and XML-Indexing-Demo found in the sandbox are not worth preserving. Anyone feel that these pieces left in the sandbox should be preserved? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Whichever place you settle on is fine with me. [In case it might make a difference: Just note that MemoryIndex has a small auxiliary dependency on PatternAnalyzer in addField() because the Analyzer superclass doesn't have a tokenStream(String fieldName, String text) method. And PatternAnalyzer requires JDK 1.4 or higher] Wolfgang. On Apr 27, 2005, at 9:22 AM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Wolfgang, You have provided a superb set of patches! I'm in awe of the extensive documentation you've done. There is nothing further you need to do, but be patient while we incorporate it into the contrib area somewhere. Your PatternAnalyzer could fit into the contrib/analyzers area nicely. I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Erik On Apr 26, 2005, at 9:47 PM, Wolfgang Hoschek wrote: I've uploaded slightly improved versions of the fast MemoryIndex contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with another contrib - PatternAnalyzer. For a quick overview without downloading code, there's javadoc for it all at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html I'm happy to maintain these classes externally as part of the Nux project. But from the preliminary discussion on the list some time ago I gathered there'd be some wider interest, hence I prepared the contribs for the community. What would be the next steps for taking this further, if any? Thanks, Wolfgang. /** * Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular expression [EMAIL PROTECTED] Pattern} * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}), * and that combines the functionality of * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single efficient * multi-purpose class. * * If you are unsure how exactly a regular expression should look like, consider * prototyping by simply trying various expressions on some test texts via * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that regex to * PatternAnalyzer. Also see * href="http://java.sun.com/docs/books/tutorial/extra/regex/";>Java Regular Expression Tutorial. * * This class can be considerably faster than the "normal" Lucene tokenizers. * It can also serve as a building block in a compound Lucene * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example as in this * stemming example: * * PatternAnalyzer pat = ... * TokenStream tokenStream = new SnowballFilter( * pat.tokenStream("content", "James is running round in the woods"), * "English")); * On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote: I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 For a quick overview without downloading code, there's some javadoc at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html There are several small open issues listed in the javadoc and also inside the code. Thoughts? Comments? I've also got small performance patches for various parts of Lucene core (not submitted yet). Taken together they lead to substantially improved performance for MemoryIndex, and most likely also for Lucene in general. Some of them are more involved than others. I'm now figuring out how much performance each of these contributes and how to propose potential integration - stay tuned for some follow-ups to this. The code as submitted would certainly benefit a lot from said patches, but they are not required for correct operation. It should work out of the box (currently only on 1.4.3 or lower). Try running cd lucene-cvs java org.apache.lucene.index.memory.MemoryIndexTest with or without custom arguments to see it in action. Before turning to a performance patch discussion I'd a this point rather be most interested in folks giving it a spin, comments on the API, or any other issues. Cheers, Wolfgang. On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote: On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up o
Re: [Performance] Streaming main memory indexing of single strings
I've uploaded slightly improved versions of the fast MemoryIndex contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with another contrib - PatternAnalyzer. For a quick overview without downloading code, there's javadoc for it all at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html I'm happy to maintain these classes externally as part of the Nux project. But from the preliminary discussion on the list some time ago I gathered there'd be some wider interest, hence I prepared the contribs for the community. What would be the next steps for taking this further, if any? Thanks, Wolfgang. /** * Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular expression [EMAIL PROTECTED] Pattern} * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}), * and that combines the functionality of * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single efficient * multi-purpose class. * * If you are unsure how exactly a regular expression should look like, consider * prototyping by simply trying various expressions on some test texts via * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that regex to * PatternAnalyzer. Also see * href="http://java.sun.com/docs/books/tutorial/extra/regex/";>Java Regular Expression Tutorial. * * This class can be considerably faster than the "normal" Lucene tokenizers. * It can also serve as a building block in a compound Lucene * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example as in this * stemming example: * * PatternAnalyzer pat = ... * TokenStream tokenStream = new SnowballFilter( * pat.tokenStream("content", "James is running round in the woods"), * "English")); * On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote: I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 For a quick overview without downloading code, there's some javadoc at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html There are several small open issues listed in the javadoc and also inside the code. Thoughts? Comments? I've also got small performance patches for various parts of Lucene core (not submitted yet). Taken together they lead to substantially improved performance for MemoryIndex, and most likely also for Lucene in general. Some of them are more involved than others. I'm now figuring out how much performance each of these contributes and how to propose potential integration - stay tuned for some follow-ups to this. The code as submitted would certainly benefit a lot from said patches, but they are not required for correct operation. It should work out of the box (currently only on 1.4.3 or lower). Try running cd lucene-cvs java org.apache.lucene.index.memory.MemoryIndexTest with or without custom arguments to see it in action. Before turning to a performance patch discussion I'd a this point rather be most interested in folks giving it a spin, comments on the API, or any other issues. Cheers, Wolfgang. On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote: On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up once I'm ready and put that into the contrib area? My version now supports tokenization with any analyzer and it supports any arbitrary Lucene query. I might make the API for adding terms a little more general, perhaps allowing arbitrary Document objects if that's what other folks really need... As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala "*fish"? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading wildcards from be
Re: [Performance] Streaming main memory indexing of single strings
I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 For a quick overview without downloading code, there's some javadoc at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html There are several small open issues listed in the javadoc and also inside the code. Thoughts? Comments? I've also got small performance patches for various parts of Lucene core (not submitted yet). Taken together they lead to substantially improved performance for MemoryIndex, and most likely also for Lucene in general. Some of them are more involved than others. I'm now figuring out how much performance each of these contributes and how to propose potential integration - stay tuned for some follow-ups to this. The code as submitted would certainly benefit a lot from said patches, but they are not required for correct operation. It should work out of the box (currently only on 1.4.3 or lower). Try running cd lucene-cvs java org.apache.lucene.index.memory.MemoryIndexTest with or without custom arguments to see it in action. Before turning to a performance patch discussion I'd a this point rather be most interested in folks giving it a spin, comments on the API, or any other issues. Cheers, Wolfgang. On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote: On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up once I'm ready and put that into the contrib area? My version now supports tokenization with any analyzer and it supports any arbitrary Lucene query. I might make the API for adding terms a little more general, perhaps allowing arbitrary Document objects if that's what other folks really need... As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala "*fish"? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading wildcards from being accepted. Any particular reason for this restriction? Is this simply a current parser limitation or something inherent? QueryParser supports wildcard characters in the middle of strings no problem though. Are you seeing otherwise? I ment an infix query such as "*fish*" Wolfgang. --- Wolfgang Hoschek | email: [EMAIL PROTECTED] Distributed Systems Department| phone: (415)-533-7610 Berkeley Laboratory | http://dsd.lbl.gov/~hoschek/ --- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up once I'm ready and put that into the contrib area? My version now supports tokenization with any analyzer and it supports any arbitrary Lucene query. I might make the API for adding terms a little more general, perhaps allowing arbitrary Document objects if that's what other folks really need... As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala "*fish"? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading wildcards from being accepted. Any particular reason for this restriction? Is this simply a current parser limitation or something inherent? QueryParser supports wildcard characters in the middle of strings no problem though. Are you seeing otherwise? I ment an infix query such as "*fish*" Wolfgang. --- Wolfgang Hoschek | email: [EMAIL PROTECTED] Distributed Systems Department| phone: (415)-533-7610 Berkeley Laboratory | http://dsd.lbl.gov/~hoschek/ --- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala "*fish"? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading wildcards from being accepted. QueryParser supports wildcard characters in the middle of strings no problem though. Are you seeing otherwise? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Good point. By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. I'll do some cleanup and documentation and then post this to the list for review RSN. As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala "*fish"? Wolfgang. On Apr 20, 2005, at 6:10 AM, Vanlerberghe, Luc wrote: One reason to choose the 'simplistic IndexReader' approach to this problem over regex's is that the result should be 'bug-compatible' with a standard search over all documents. Differences between the two systems would be difficult to explain to an end-user (let alone for the developer to debug and find the reason in the first place!) Luc -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Saturday, April 16, 2005 2:09 AM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the "match this single string against a bunch of queries"), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling o
RE: [Performance] Streaming main memory indexing of single strings
One reason to choose the 'simplistic IndexReader' approach to this problem over regex's is that the result should be 'bug-compatible' with a standard search over all documents. Differences between the two systems would be difficult to explain to an end-user (let alone for the developer to debug and find the reason in the first place!) Luc -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Saturday, April 16, 2005 2:09 AM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: > Cool! For my use case it would need to be able to handle arbitrary > queries (previously parsed from a general lucene query string). > Something like: > > float match(String Text, Query query) > > it's fine with me if it also works for > > float[] match(String[] texts, Query query) or > float(Document doc, Query query) > > but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the "match this single string against a bunch of queries"), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik > > Wolfgang. > >> I am intrigued by this and decided to mock a quick and dirty example >> of such an IndexReader. After a little trial-and-error I got it >> working at least for TermQuery and WildcardQuery. I've pasted my >> code below as an example, but there is much room for improvement, >> especially in terms of performance and also in keeping track of term >> frequency, and also it would be nicer if it handled the analysis >> internally. >> >> I think something like this would make a handy addition to our >> contrib area at least. I'd be happy to receive improvements to this >> and then add it to a contrib subproject. >> >> Perhaps this would be a handy way to handle situations where users >> have queries saved in a system and need to be alerted whenever a new >> document arrives matching the saved queries? >> >> Erik >> >> >> >>> >>> >>> -Original Message- >>> From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] >>> Sent: Thursday, April 14, 2005 4:04 PM >>> To: java-dev@lucene.apache.org >>> Subject: Re: [Performance] Streaming main memory indexing of single >>> strings >>> >>> >>> This seems to be a promising avenue worth exploring. My gutfeeling >>> is that this could easily be 10-100 times faster. >>> >>> The drawback is that it requires a fair amount of understanding of >>> intricate Lucene internals, pulling those pieces together and >>> adapting them as required for the seemingly simple "float >>> match(String text, Query query)". >>> >>> I might give it a shot but I'm not sure I'll be able to pull this >>> off! >>> Is there any similar code I could look at as a starting point? >>> >>> Wolfgang. >>> >>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: >>> >>>> I think you are not approaching this the correct way. >>>> >>>> Pseudo code: >>>> >>>> Subclass IndexReader. >>>> >>>> Get tokens from String 'document' using Lucene analyzers. >>>> >>>> Build simple hash-map based data structures using tokens for terms, >>>> and term positions. >>>> >>>> reimplement termDocs() and termPositions() to use the structures >>>> from above. >>>> >>>> run searches. >>>> >>>> start again with ne
Re: [Performance] Streaming main memory indexing of single strings
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote: Note that "fish*~" is not a valid query expression :) Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling... Several minor QueryParser weirdnesses like this have turned up recently. Sure enough, that is an odd one. It parses into a PrefixQuery for "fish*" and the ~ is dropped. I consider this a bug as this should really be a parse exception. I've just filed this as a bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=34486 Thanks, Erik. If you're looking for an XML DB for managing and querying large persistent data volumes, Nux/Saxon will disappoint you. I want to store at least several hundred MB up to gigabytes and have this queryable with XQuery. With some luck it might comfortably fit into a (compressed) main memory cache. We previously used Tamino with XPath, but our XML is not well enough normalized to make this very feasible to query. eXist, last I toyed with it, only scaled to 50MB. Ok, so Nux/Saxon is out for our uses. Any recommendations though? I can't recommend any product in particular. There's a comprehensive list of impls at http://www.w3.org/XML/Query Could you briefly summarize the key usecase and an example query? If history is any indicator, then within 2-3 years the big relational DBMS vendors will have caught up with XQuery/XML extensions and eat the little special-purpose XML DB vendors for lunch - just like they did with OODBMS. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote: Note that "fish*~" is not a valid query expression :) Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling... Several minor QueryParser weirdnesses like this have turned up recently. Sure enough, that is an odd one. It parses into a PrefixQuery for "fish*" and the ~ is dropped. I consider this a bug as this should really be a parse exception. I've just filed this as a bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=34486 If you're looking for an XML DB for managing and querying large persistent data volumes, Nux/Saxon will disappoint you. I want to store at least several hundred MB up to gigabytes and have this queryable with XQuery. We previously used Tamino with XPath, but our XML is not well enough normalized to make this very feasible to query. eXist, last I toyed with it, only scaled to 50MB. Ok, so Nux/Saxon is out for our uses. Any recommendations though? Could you avoid calling match() twice here? That's no problem for two reasons: 1) The XQuery optimizer rewrites the query into an optimized expression tree eliminating redundancies, etc. If for some reason this isn't feasible or legal then 2) There's a smart cache between the XQuery engine and the lucene invocation that returns results in O(1) for Lucene queries that have already been seen/processed before. It caches (queryString,result), plus parsed Lucene queries, plus the Lucene index data structure for any given string text (which currently is a simple RAMDirectory but could be whatever datastructure we come up with as part of the exercise - class StringIndex or some such). This works so well that I have to disable the cache to avoid getting astronomically good figures on artificial benchmarks. Cool. BTW, I have some small performance patches for FastCharStream and in various other places, but I'll hold off proposing those until our exercise is done and the real merits/drawbacks of those patches can be better assessed. Excellent... we're always interested in performance improvements! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote: On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote: So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct? Right, it has no bearing. A query wouldn't specify any fields, it just uses the implicit default field name. Cool. My questions regarding how to deal with field names is obviously more an implementation detail under the covers of the match() method than how you want to use it. In a general sense, though, its necessary to deal with default field name, queries that have non-default-field terms, and the analysis process. Right, I'd just like to first assess rough overall efficiency before tying up some loose ends. (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene query goes here :) Note that "fish*~" is not a valid query expression :) Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling... (I love how XQuery uses smiley emoticons for comments) BTW, I have a strong vested interest in seeing a fast and scalable XQuery engine in the open source world. I've toyed with eXist some - it was not stable or scalable enough for my needs. Lot's of Wolfgang's in the XQuery world :) If you're looking for an XML DB for managing and querying large persistent data volumes, Nux/Saxon will disappoint you. If, on the other hand, you're looking for a very fast XQuery engine inserted into a processing pipeline working with many small to medium sized XML documents (such as messages in a scalable message queue or network router) then you might be pleased. for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) Could you avoid calling match() twice here? That's no problem for two reasons: 1) The XQuery optimizer rewrites the query into an optimized expression tree eliminating redundancies, etc. If for some reason this isn't feasible or legal then 2) There's a smart cache between the XQuery engine and the lucene invocation that returns results in O(1) for Lucene queries that have already been seen/processed before. It caches (queryString,result), plus parsed Lucene queries, plus the Lucene index data structure for any given string text (which currently is a simple RAMDirectory but could be whatever datastructure we come up with as part of the exercise - class StringIndex or some such). This works so well that I have to disable the cache to avoid getting astronomically good figures on artificial benchmarks. some skeleton: private static final String FIELD_NAME = "content"; // or whatever - it doesn't matter public Query parseQuery(String expression) throws ParseException { QueryParser parser = new QueryParser(FIELD_NAME, analyzer); return parser.parse(expression); } private Document createDocument(String content) { Document doc = new Document(); doc.add(Field.UnStored(FIELD_NAME, content)); return doc; } This skeleton code doesn't really apply to the custom IndexReader implementation. There is a method to return a document from IndexReader, which I did not implement yet in my sample - it'd be trivial though. I don't think you'd need to get a Lucene Document object back in your use case, but for completeness I will add that to my implementation. Right, it was just to outline that the value of FIELD_NAME doesn't really matter. There is still some missing trickery in my StringIndexReader - it does not currently handle phrase queries as an implementation of termPositions() is needed. Wolfgang - will you take what I've done the extra mile and implement what's left (frequency and term position)? I might not revisit this very soon. I'm not sure I'll be able to pull it off, but I'll see what I can do. If someone more competent would like to help out, let me know... Thanks for all the help anyway, Erik and co, it is greatly appreciated! If you can build an XQuery engine, you can hack in some basic Java data structures that keep track of word positions and frequency :) There's a learning curve ahead of me, not having working before at that low-level with Lucene :-) Mark Harwood sent me some good but somewhat unfinished code he wrote previously for similar scenarios. I'll look into merging his pieces and your skeleton. By now I'm quite confident this can be done reasonably efficient. BTW, I have some small performance patches for FastCharStream and in various other places, but I'll hold off proposing thos
Re: [Performance] Streaming main memory indexing of single strings
On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote: So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct? Right, it has no bearing. A query wouldn't specify any fields, it just uses the implicit default field name. Cool. My questions regarding how to deal with field names is obviously more an implementation detail under the covers of the match() method than how you want to use it. In a general sense, though, its necessary to deal with default field name, queries that have non-default-field terms, and the analysis process. (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene query goes here :) Note that "fish*~" is not a valid query expression :) (I love how XQuery uses smiley emoticons for comments) BTW, I have a strong vested interest in seeing a fast and scalable XQuery engine in the open source world. I've toyed with eXist some - it was not stable or scalable enough for my needs. Lot's of Wolfgang's in the XQuery world :) for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) Could you avoid calling match() twice here? some skeleton: private static final String FIELD_NAME = "content"; // or whatever - it doesn't matter public Query parseQuery(String expression) throws ParseException { QueryParser parser = new QueryParser(FIELD_NAME, analyzer); return parser.parse(expression); } private Document createDocument(String content) { Document doc = new Document(); doc.add(Field.UnStored(FIELD_NAME, content)); return doc; } This skeleton code doesn't really apply to the custom IndexReader implementation. There is a method to return a document from IndexReader, which I did not implement yet in my sample - it'd be trivial though. I don't think you'd need to get a Lucene Document object back in your use case, but for completeness I will add that to my implementation. There is still some missing trickery in my StringIndexReader - it does not currently handle phrase queries as an implementation of termPositions() is needed. Wolfgang - will you take what I've done the extra mile and implement what's left (frequency and term position)? I might not revisit this very soon. I'm not sure I'll be able to pull it off, but I'll see what I can do. If someone more competent would like to help out, let me know... Thanks for all the help anyway, Erik and co, it is greatly appreciated! If you can build an XQuery engine, you can hack in some basic Java data structures that keep track of word positions and frequency :) I'll tinker with it some more for fun in the near future, but anyone else is welcome to flesh out the missing pieces. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
le situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -----Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents to it. See the highlights and the snipets in the first hit: http://www.lucenebook.com/search?query=when+to+optimize Otis --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typical for XML network routers, message queuing system, P2P networks, etc. In this on-the-fly main memory indexing scenario, each individual string is immediately matched as soon as it becomes available without any persistance involved. This usage scenario and corresponding performance profile is quite different in comparison to fulltext search over persistent (read-mostly) indexes. The benchmark runs at some 3000 luce
Re: [Performance] Streaming main memory indexing of single strings
On Apr 15, 2005, at 8:18 PM, Wolfgang Hoschek wrote: The main issue is to enable handling arbitrary queries (anything derived from o.a.l.search.Query). Yes, there'd be an additional method Analyzer parameter (support any analyzer). The use case does not require field names. One could internally use "content" or anything else for the default field name, which is the one to implicitly be queried... To be general-purpose, though, passing in the default field name would be needed. All of Lucene's built-in analyzers ignore it, but that is not the case everywhere. For example, NutchAnalyzer does different things for different fields. So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct? There is still some missing trickery in my StringIndexReader - it does not currently handle phrase queries as an implementation of termPositions() is needed. Wolfgang - will you take what I've done the extra mile and implement what's left (frequency and term position)? I might not revisit this very soon. Erik Wolfgang. On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote: On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the "match this single string against a bunch of queries"), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-de
Re: [Performance] Streaming main memory indexing of single strings
The main issue is to enable handling arbitrary queries (anything derived from o.a.l.search.Query). Yes, there'd be an additional method Analyzer parameter (support any analyzer). The use case does not require field names. One could internally use "content" or anything else for the default field name, which is the one to implicitly be queried... Wolfgang. On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote: On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the "match this single string against a bunch of queries"), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDire
Re: [Performance] Streaming main memory indexing of single strings
On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the "match this single string against a bunch of queries"), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and
Re: [Performance] Streaming main memory indexing of single strings
Thanks for pointing this out! The overhead wasn't substantial it turns out. On closer inspection/profiling there's more substantial (hidden) baggage when subclassing IndexWriter. Would not suclassing IndexWriter have similarly unexpected negative consequences? Thanks, Wolfgang. On Apr 15, 2005, at 4:13 PM, Robert Engels wrote: You cannot do this as easily as it sounds. As I've pointed out on this list before, there are a multitude of places in the Query handling that assume an IndexReader is available. The Searcher interface doesn't really buy you much because the only available implementations are IndexSearcher, and that assumes an IndexReader implementation is available. To provide a pure implementation of Searcher you would need to reimplement all of Lucene. Trust me, stick with IndexReader subclassing (ideally, IndexReader would be an interface, and the core would be moved into AbstractIndexReader so projects like this would be much easier). Robert -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Friday, April 15, 2005 5:58 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings A primary reason for the tight API proposed below is that it allows for maximum efficiency (which is the point of the exercise in the first place): - One can extend Searcher rather than IndexReader: There's no need to subclass IndexReader and carry the baggage of the superclass doing locking and all sort of unnecessary stuff with its internal RAMDirectory. - Even more extreme: Don't extend Searcher but implement the functionality directly using low-level APIs. This avoids unnecessary baggage for collecting hits, etc. Wolfgang. On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to m
Re: [Performance] Streaming main memory indexing of single strings
On Apr 15, 2005, at 4:15 PM, Doug Cutting wrote: Wolfgang Hoschek wrote: The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute. With a single document there'll be no IDFs, so you could simply sum sqrt() of term regex match counts, and divide by the sqrt of the length of the string. Is there a function f that can translate any lucene query (with all its syntax and fuzzy features) to a regex? E.g. how to translate StandardAnalyzer or stemming into a regex? If so, yes, but that seems unlikely, no? My particular interest is to use XQuery for *precisely* locating information subsets in networked XML messages, and then to use Lucene's fulltext functionality for *fuzzy* searches within such a precise subset. Messages are classified and routed/forwarded accordingly. See http://dsd.lbl.gov/nux/ for background. [BTW, XQuery already has regexes built-in]. Yes, I'm playing devil's advocate... Always a good thing to check assumptions :-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [Performance] Streaming main memory indexing of single strings
I think one of the advantages may be the analyzers and processors that are already available for several documents types. Using regex with these is nearly impossible. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, April 15, 2005 6:16 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Wolfgang Hoschek wrote: > The classic fuzzy fulltext search and similarity matching that Lucene is > good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute. With a single document there'll be no IDFs, so you could simply sum sqrt() of term regex match counts, and divide by the sqrt of the length of the string. Yes, I'm playing devil's advocate... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Wolfgang Hoschek wrote: The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute. With a single document there'll be no IDFs, so you could simply sum sqrt() of term regex match counts, and divide by the sqrt of the length of the string. Yes, I'm playing devil's advocate... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [Performance] Streaming main memory indexing of single strings
You cannot do this as easily as it sounds. As I've pointed out on this list before, there are a multitude of places in the Query handling that assume an IndexReader is available. The Searcher interface doesn't really buy you much because the only available implementations are IndexSearcher, and that assumes an IndexReader implementation is available. To provide a pure implementation of Searcher you would need to reimplement all of Lucene. Trust me, stick with IndexReader subclassing (ideally, IndexReader would be an interface, and the core would be moved into AbstractIndexReader so projects like this would be much easier). Robert -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Friday, April 15, 2005 5:58 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings A primary reason for the tight API proposed below is that it allows for maximum efficiency (which is the point of the exercise in the first place): - One can extend Searcher rather than IndexReader: There's no need to subclass IndexReader and carry the baggage of the superclass doing locking and all sort of unnecessary stuff with its internal RAMDirectory. - Even more extreme: Don't extend Searcher but implement the functionality directly using low-level APIs. This avoids unnecessary baggage for collecting hits, etc. Wolfgang. On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote: > Cool! For my use case it would need to be able to handle arbitrary > queries (previously parsed from a general lucene query string). > Something like: > > float match(String Text, Query query) > > it's fine with me if it also works for > > float[] match(String[] texts, Query query) or > float(Document doc, Query query) > > but that isn't required by the use case. > > Wolfgang. > >> I am intrigued by this and decided to mock a quick and dirty example >> of such an IndexReader. After a little trial-and-error I got it >> working at least for TermQuery and WildcardQuery. I've pasted my >> code below as an example, but there is much room for improvement, >> especially in terms of performance and also in keeping track of term >> frequency, and also it would be nicer if it handled the analysis >> internally. >> >> I think something like this would make a handy addition to our >> contrib area at least. I'd be happy to receive improvements to this >> and then add it to a contrib subproject. >> >> Perhaps this would be a handy way to handle situations where users >> have queries saved in a system and need to be alerted whenever a new >> document arrives matching the saved queries? >> >> Erik >> >> >> >>> >>> >>> -----Original Message- >>> From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] >>> Sent: Thursday, April 14, 2005 4:04 PM >>> To: java-dev@lucene.apache.org >>> Subject: Re: [Performance] Streaming main memory indexing of single >>> strings >>> >>> >>> This seems to be a promising avenue worth exploring. My gutfeeling is >>> that this could easily be 10-100 times faster. >>> >>> The drawback is that it requires a fair amount of understanding of >>> intricate Lucene internals, pulling those pieces together and >>> adapting >>> them as required for the seemingly simple "float match(String text, >>> Query query)". >>> >>> I might give it a shot but I'm not sure I'll be able to pull this >>> off! >>> Is there any similar code I could look at as a starting point? >>> >>> Wolfgang. >>> >>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: >>> >>>> I think you are not approaching this the correct way. >>>> >>>> Pseudo code: >>>> >>>> Subclass IndexReader. >>>> >>>> Get tokens from String 'document' using Lucene analyzers. >>>> >>>> Build simple hash-map based data structures using tokens for terms, >>>> and term >>>> positions. >>>> >>>> reimplement termDocs() and termPositions() to use the structures >>>> from >>>> above. >>>> >>>> run searches. >>>> >>>> start again with next document. >>>> >>>> >>>> >>>> -Original Message- >>>> From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] >>>> Sent: Thursday, April 14, 2005 2:56 PM >>>> To: java-dev@lucene.apache.org >>>>
Re: [Performance] Streaming main memory indexing of single strings
On Apr 15, 2005, at 4:00 PM, Doug Cutting wrote: Erik Hatcher wrote: I think something like this would make a handy addition to our contrib area at least. Perhaps. What use cases cannot be met by regular expression matching? Doug The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Erik Hatcher wrote: I think something like this would make a handy addition to our contrib area at least. Perhaps. What use cases cannot be met by regular expression matching? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
A primary reason for the tight API proposed below is that it allows for maximum efficiency (which is the point of the exercise in the first place): - One can extend Searcher rather than IndexReader: There's no need to subclass IndexReader and carry the baggage of the superclass doing locking and all sort of unnecessary stuff with its internal RAMDirectory. - Even more extreme: Don't extend Searcher but implement the functionality directly using low-level APIs. This avoids unnecessary baggage for collecting hits, etc. Wolfgang. On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $scor
Re: [Performance] Streaming main memory indexing of single strings
Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and opt
Re: [Performance] Streaming main memory indexing of single strings
On Apr 14, 2005, at 5:11 PM, Robert Engels wrote: It is really not that involved. Just implement the abstract methods of IndexReader. And many cane be no-op'd because they will never be called in a "read only" situation. Methods related to normalization and such can also be no-op'd because you are only dealing with a single document. I would think you will find this approach at least an order of magnitude faster, if not 2. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Kee
RE: [Performance] Streaming main memory indexing of single strings
It is really not that involved. Just implement the abstract methods of IndexReader. And many cane be no-op'd because they will never be called in a "read only" situation. Methods related to normalization and such can also be no-op'd because you are only dealing with a single document. I would think you will find this approach at least an order of magnitude faster, if not 2. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: > I think you are not approaching this the correct way. > > Pseudo code: > > Subclass IndexReader. > > Get tokens from String 'document' using Lucene analyzers. > > Build simple hash-map based data structures using tokens for terms, > and term > positions. > > reimplement termDocs() and termPositions() to use the structures from > above. > > run searches. > > start again with next document. > > > > -Original Message- > From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 14, 2005 2:56 PM > To: java-dev@lucene.apache.org > Subject: Re: [Performance] Streaming main memory indexing of single > strings > > > Otis, this might be a misunderstanding. > > - I'm not calling optimize(). That piece is commented out you if look > again at the code. > - The *streaming* use case requires that for each query I add one (and > only one) document (aka string) to an empty index: > > repeat N times (where N is millions or billions): > add a single string (aka document) to an empty index > query the index > drop index (or delete it's document) > > with the following API being called N times: float match(String text, > Query query) > > So there's no possibility of adding many documents and thereafter > running the query. This in turn seems to mean that the IndexWriter > can't be kept open - unless I manually delete each document after each > query to repeatedly reuse the RAMDirectory, which I've also tried > before without any significant performance gain - deletion seems to > have substantial overhead in itself. Perhaps it would be better if > there were a Directory.deleteAllDocuments() or similar. Did you have > some other approach in mind? > > As I said, Lucene's design doesn't seem to fit this streaming use case > pattern well. In *this* scenario one could easily do without any > locking, and without byte level organization in RAMDirectory and > RAMFile, etc because a single small string isn't a large persistent > multi-document index. > > For some background, here's a small example for the kind of XQuery > functionality Nux/Lucene integration enables: > > (: An XQuery that finds all books authored by James that have something > to do with "fish", sorted by relevance :) > declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; > declare variable $query := "fish*~"; > > for $book in /books/book[author="James" and lucene:match(string(.), > $query) > 0.0] > let $score := lucene:match(string($book), $query) > order by $score descending > return ({$score}, $book) > > More interestingly one can use this for classifying and routing XML > messages based on rules (i.e. queries) inspecting their content... > > Any other clues about potential improvements would be greatly > appreciated. > > Wolfgang. > > On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: > >> It looks like you are calling that IndexWriter code in some loops, >> opening it and closing it in every iteration of the loop and also >> calling optimize. All of those things could be improved. >> Keep your IndexWriter open, don't close it, and optimize the index >> only >> once you are done adding documents to it. >> >> See the highlights and the snipets in the first hit: >> http://www.lucenebook.com/search?query=when+to+optimize >> >> Otis >> >> >> --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >>> Hi, &
Re: [Performance] Streaming main memory indexing of single strings
This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)". I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents to it. See the highlights and the snipets in the first hit: http://www.lucenebook.com/search?query=when+to+optimize Otis --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typical for XML network routers, message queuing system, P2P networks, etc. In this on-the-fly main memory indexing scenario, each individual string is immediately matched as soon as it becomes available without any persistance involved. This usage scenario and corresponding performance profile is quite different in comparison to fulltext search over persistent (read-mostly) indexes. The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) which is unfortunate news considering the XQuery engine can easily walk hundreds of thousands of XML nodes per second. Ideally I'd like to run at some 10 queries/sec. Runnning this through the JDK 1.5 profiler it seems that most time is spent in and below the following calls: writer = new IndexWriter(dir, analyzer, true); writer.addDocument(...); writer.close(); I tried quite a few variants of the benchmark
RE: [Performance] Streaming main memory indexing of single strings
I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: > It looks like you are calling that IndexWriter code in some loops, > opening it and closing it in every iteration of the loop and also > calling optimize. All of those things could be improved. > Keep your IndexWriter open, don't close it, and optimize the index only > once you are done adding documents to it. > > See the highlights and the snipets in the first hit: > http://www.lucenebook.com/search?query=when+to+optimize > > Otis > > > --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I'm wondering if anyone could let me know how to improve Lucene >> performance for "streaming main memory indexing of single strings". >> This would help to effectively integrate Lucene with the Nux XQuery >> engine. >> >> Below is a small microbenchmark simulating STREAMING XQuery fulltext >> search as typical for XML network routers, message queuing system, >> P2P >> networks, etc. In this on-the-fly main memory indexing scenario, each >> >> individual string is immediately matched as soon as it becomes >> available without any persistance involved. This usage scenario and >> corresponding performance profile is quite different in comparison to >> >> fulltext search over persistent (read-mostly) indexes. >> >> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) >> which >> is unfortunate news considering the XQuery engine can easily walk >> hundreds of thousands of XML nodes per second. Ideally I'd like to >> run >> at some 10 queries/sec. Runnning this through the JDK 1.5 >> profiler >> it seems that most time is spent in and below the following calls: >> >> writer = new IndexWriter(dir, analyzer, true); >> writer.addDocument(...); >> writer.close(); >> >> I tried quite a few variants of the benchmark with various options, >> unfortunately with little or no effect. >> Lucene just does not seem to designed to do this sort of "transient >> single string i
Re: [Performance] Streaming main memory indexing of single strings
Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billions): add a single string (aka document) to an empty index query the index drop index (or delete it's document) with the following API being called N times: float match(String text, Query query) So there's no possibility of adding many documents and thereafter running the query. This in turn seems to mean that the IndexWriter can't be kept open - unless I manually delete each document after each query to repeatedly reuse the RAMDirectory, which I've also tried before without any significant performance gain - deletion seems to have substantial overhead in itself. Perhaps it would be better if there were a Directory.deleteAllDocuments() or similar. Did you have some other approach in mind? As I said, Lucene's design doesn't seem to fit this streaming use case pattern well. In *this* scenario one could easily do without any locking, and without byte level organization in RAMDirectory and RAMFile, etc because a single small string isn't a large persistent multi-document index. For some background, here's a small example for the kind of XQuery functionality Nux/Lucene integration enables: (: An XQuery that finds all books authored by James that have something to do with "fish", sorted by relevance :) declare namespace lucene = "java:nux.xom.xquery.XQueryUtil"; declare variable $query := "fish*~"; for $book in /books/book[author="James" and lucene:match(string(.), $query) > 0.0] let $score := lucene:match(string($book), $query) order by $score descending return ({$score}, $book) More interestingly one can use this for classifying and routing XML messages based on rules (i.e. queries) inspecting their content... Any other clues about potential improvements would be greatly appreciated. Wolfgang. On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote: It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents to it. See the highlights and the snipets in the first hit: http://www.lucenebook.com/search?query=when+to+optimize Otis --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typical for XML network routers, message queuing system, P2P networks, etc. In this on-the-fly main memory indexing scenario, each individual string is immediately matched as soon as it becomes available without any persistance involved. This usage scenario and corresponding performance profile is quite different in comparison to fulltext search over persistent (read-mostly) indexes. The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) which is unfortunate news considering the XQuery engine can easily walk hundreds of thousands of XML nodes per second. Ideally I'd like to run at some 10 queries/sec. Runnning this through the JDK 1.5 profiler it seems that most time is spent in and below the following calls: writer = new IndexWriter(dir, analyzer, true); writer.addDocument(...); writer.close(); I tried quite a few variants of the benchmark with various options, unfortunately with little or no effect. Lucene just does not seem to designed to do this sort of "transient single string index" thing. All code paths related to opening, closing, reading, writing, querying and object creation seem to be designed for large persistent indexes. Any advice on what I'm missing or what could be done about it would be greatly appreciated. Wolfgang. P.S. the benchmark code is attached as a file below: package nux.xom.pool; import java.io.IOException; //import java.io.Reader; import org.apache.lucene.analysis.Analyzer; //import org.apache.lucene.analysis.LowerCaseTokenizer; //import org.apache.lucene.analysis.PorterStemFilter; //import org.apache.lucene.analysis.SimpleAnalyzer; //import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; //import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; impo
Re: [Performance] Streaming main memory indexing of single strings
It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents to it. See the highlights and the snipets in the first hit: http://www.lucenebook.com/search?query=when+to+optimize Otis --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: > Hi, > > I'm wondering if anyone could let me know how to improve Lucene > performance for "streaming main memory indexing of single strings". > This would help to effectively integrate Lucene with the Nux XQuery > engine. > > Below is a small microbenchmark simulating STREAMING XQuery fulltext > search as typical for XML network routers, message queuing system, > P2P > networks, etc. In this on-the-fly main memory indexing scenario, each > > individual string is immediately matched as soon as it becomes > available without any persistance involved. This usage scenario and > corresponding performance profile is quite different in comparison to > > fulltext search over persistent (read-mostly) indexes. > > The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) > which > is unfortunate news considering the XQuery engine can easily walk > hundreds of thousands of XML nodes per second. Ideally I'd like to > run > at some 10 queries/sec. Runnning this through the JDK 1.5 > profiler > it seems that most time is spent in and below the following calls: > > writer = new IndexWriter(dir, analyzer, true); > writer.addDocument(...); > writer.close(); > > I tried quite a few variants of the benchmark with various options, > unfortunately with little or no effect. > Lucene just does not seem to designed to do this sort of "transient > single string index" thing. All code paths related to opening, > closing, > reading, writing, querying and object creation seem to be designed > for > large persistent indexes. > > Any advice on what I'm missing or what could be done about it would > be > greatly appreciated. > > Wolfgang. > > P.S. the benchmark code is attached as a file below: > > > package nux.xom.pool; > > import java.io.IOException; > //import java.io.Reader; > > import org.apache.lucene.analysis.Analyzer; > //import org.apache.lucene.analysis.LowerCaseTokenizer; > //import org.apache.lucene.analysis.PorterStemFilter; > //import org.apache.lucene.analysis.SimpleAnalyzer; > //import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > //import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.queryParser.ParseException; > import org.apache.lucene.queryParser.QueryParser; > import org.apache.lucene.search.Hits; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.Searcher; > import org.apache.lucene.store.Directory; > import org.apache.lucene.store.RAMDirectory; > > public final class LuceneMatcher { // TODO: make non-public > > private final Analyzer analyzer; > //private final Directory dir = new RAMDirectory(); > > public LuceneMatcher() { > this(new StandardAnalyzer()); > //this(new SimpleAnalyzer()); > //this(new StopAnalyzer()); > //this(new Analyzer() { > //public final TokenStream tokenStream(String fieldName, > Reader > reader) { > //return new PorterStemFilter(new > LowerCaseTokenizer(reader)); > //} > //}); > } > > public LuceneMatcher(Analyzer analyzer) { > if (analyzer == null) > throw new IllegalArgumentException("analyzer must not > be null"); > this.analyzer = analyzer; > } > > public Query parseQuery(String expression) throws ParseException { > QueryParser parser = new QueryParser("content", analyzer); > //parser.setPhraseSlop(0); > return parser.parse(expression); > } > > /** >* Returns the relevance score by matching the given index against > the given >* Lucene query expression. The index must not contain more than one > Lucene >* "document" (aka string to be searched). >*/ > public float match(Directory index, Query query) { > Searcher searcher = null; > try { > searcher = new IndexSearcher(index); > Hits hits = searcher.search(query); > float score = hits.length() > 0 ? hits.score(0) : 0.0f; > > return score; > } catch (IOException e) { //
[Performance] Streaming main memory indexing of single strings
Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typical for XML network routers, message queuing system, P2P networks, etc. In this on-the-fly main memory indexing scenario, each individual string is immediately matched as soon as it becomes available without any persistance involved. This usage scenario and corresponding performance profile is quite different in comparison to fulltext search over persistent (read-mostly) indexes. The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) which is unfortunate news considering the XQuery engine can easily walk hundreds of thousands of XML nodes per second. Ideally I'd like to run at some 10 queries/sec. Runnning this through the JDK 1.5 profiler it seems that most time is spent in and below the following calls: writer = new IndexWriter(dir, analyzer, true); writer.addDocument(...); writer.close(); I tried quite a few variants of the benchmark with various options, unfortunately with little or no effect. Lucene just does not seem to designed to do this sort of "transient single string index" thing. All code paths related to opening, closing, reading, writing, querying and object creation seem to be designed for large persistent indexes. Any advice on what I'm missing or what could be done about it would be greatly appreciated. Wolfgang. P.S. the benchmark code is attached as a file below: package nux.xom.pool; import java.io.IOException; //import java.io.Reader; import org.apache.lucene.analysis.Analyzer; //import org.apache.lucene.analysis.LowerCaseTokenizer; //import org.apache.lucene.analysis.PorterStemFilter; //import org.apache.lucene.analysis.SimpleAnalyzer; //import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; //import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; public final class LuceneMatcher { // TODO: make non-public private final Analyzer analyzer; // private final Directory dir = new RAMDirectory(); public LuceneMatcher() { this(new StandardAnalyzer()); // this(new SimpleAnalyzer()); // this(new StopAnalyzer()); // this(new Analyzer() { // public final TokenStream tokenStream(String fieldName, Reader reader) { // return new PorterStemFilter(new LowerCaseTokenizer(reader)); // } // }); } public LuceneMatcher(Analyzer analyzer) { if (analyzer == null) throw new IllegalArgumentException("analyzer must not be null"); this.analyzer = analyzer; } public Query parseQuery(String expression) throws ParseException { QueryParser parser = new QueryParser("content", analyzer); // parser.setPhraseSlop(0); return parser.parse(expression); } /** * Returns the relevance score by matching the given index against the given * Lucene query expression. The index must not contain more than one Lucene * "document" (aka string to be searched). */ public float match(Directory index, Query query) { Searcher searcher = null; try { searcher = new IndexSearcher(index); Hits hits = searcher.search(query); float score = hits.length() > 0 ? hits.score(0) : 0.0f; return score; } catch (IOException e) { // should never happen (RAMDirectory) throw new RuntimeException(e); } finally { try { if (searcher != null) searcher.close(); } catch (IOException e) { // should never happen (RAMDirectory) throw new RuntimeException(e); } } } // public float match(String text, Query query) { // return match(createIndex(text), query); // } public Directory createIndex(String text) { Directory dir =
[Performance] Streaming main memory indexing of single strings
Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typical for XML network routers, message queuing system, P2P networks, etc. In this on-the-fly main memory indexing scenario, each individual string is immediately matched as soon as it becomes available without any persistance involved. This usage scenario and corresponding performance profile is quite different in comparison to fulltext search over persistent (read-mostly) indexes. The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) which is unfortunate news considering the XQuery engine can easily walk hundreds of thousands of XML nodes per second. Ideally I'd like to run at some 10 queries/sec. Runnning this through the JDK 1.5 profiler it seems that most time is spent in and below the following calls: writer = new IndexWriter(dir, analyzer, true); writer.addDocument(...); writer.close(); I tried quite a few variants of the benchmark with various options, unfortunately with little or no effect. Lucene just does not seem to designed to do this sort of "transient single string index" thing. All code paths related to opening, closing, reading, writing, querying and object creation seem to be designed for large persistent indexes. Any advice on what I'm missing or what could be done about it would be greatly appreciated. Wolfgang. P.S. the benchmark code is attached as a file below: package nux.xom.pool; import java.io.IOException; //import java.io.Reader; import org.apache.lucene.analysis.Analyzer; //import org.apache.lucene.analysis.LowerCaseTokenizer; //import org.apache.lucene.analysis.PorterStemFilter; //import org.apache.lucene.analysis.SimpleAnalyzer; //import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; //import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; public final class LuceneMatcher { // TODO: make non-public private final Analyzer analyzer; // private final Directory dir = new RAMDirectory(); public LuceneMatcher() { this(new StandardAnalyzer()); // this(new SimpleAnalyzer()); // this(new StopAnalyzer()); // this(new Analyzer() { // public final TokenStream tokenStream(String fieldName, Reader reader) { // return new PorterStemFilter(new LowerCaseTokenizer(reader)); // } // }); } public LuceneMatcher(Analyzer analyzer) { if (analyzer == null) throw new IllegalArgumentException("analyzer must not be null"); this.analyzer = analyzer; } public Query parseQuery(String expression) throws ParseException { QueryParser parser = new QueryParser("content", analyzer); // parser.setPhraseSlop(0); return parser.parse(expression); } /** * Returns the relevance score by matching the given index against the given * Lucene query expression. The index must not contain more than one Lucene * "document" (aka string to be searched). */ public float match(Directory index, Query query) { Searcher searcher = null; try { searcher = new IndexSearcher(index); Hits hits = searcher.search(query); float score = hits.length() > 0 ? hits.score(0) : 0.0f; return score; } catch (IOException e) { // should never happen (RAMDirectory) throw new RuntimeException(e); } finally { try { if (searcher != null) searcher.close(); } catch (IOException e) { // should never happen (RAMDirectory) throw new RuntimeException(e); } } } // public float match(String text, Query query) { // return match(createIndex(text), query); // } public Directory createIndex(String text) { Directory dir =