[jira] Created: (LUCENE-1572) luceneweb
luceneweb -- Key: LUCENE-1572 URL: https://issues.apache.org/jira/browse/LUCENE-1572 Project: Lucene - Java Issue Type: Bug Components: Examples Affects Versions: 2.4 Environment: Windows XP Reporter: kysnail Priority: Minor Lucene versin : lucene-2.4.0 According to the reference doc, can't run the luceneweb correctly. But if use version (lucene-1.4.3 ), it can working properly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Mark Miller markrmil...@gmail.com wrote: bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. MRHC was created because simply adding setNextReader to HC would break back compat, because collect(...) is called on the un-rebased doc. Ie we need a new class so we can tell that it will handle re-basing the doc itself. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene code receives a HitCollector which listens on all events. In the future, Lucene might throw other events, but still get a HitCollector. Those methods will check for instanceof, and you as a user will know that if you want to catch those events, you pass in a collector implementation which does. Those events cannot of course be main-stream events (like DocScoreEvent), but new ones, perhaps even experimental. Since HitCollector is a concrete class, we can always add new interfaces to it in the future with empty implementations? I agree interfaces clearly have useful properties, but their achilles heel for Lucene in all but the most trivial needs is the non-back-compatibility when you want to add a method to the interface. There have been a number of java-dev discussons on this problem. So, I think something like this: * Deprecate HitCollector in favor of MultiReaderHitCollector (any better name here?) abstract class. If you want to make a fully custom HitCollector, you subclass this calss. Let's change MRHC's collect to take only an [unbased] docID, and expose a setScorer(Scorer) method. Then if the collector needs score it can call Scorer.score(). * Subclass that to create an abstract tracks top N results collector (TopDocsCollector? TopHitsCollector?) * Subclass TopDocsCollector to a final, fast top N sorted by score collector (exists already: TopScoreDocCollector) * Subclass TopDocsCollector to a final, fast top N sorted by field collector (exists already: TopFieldCollector) * Subclass TopDocsCollector to a you provide your own pqueue and we collect top N according to it collector (does not yet exist -- name?). This is the way forward for existing subclasses of TopDocCollector. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1572) luceneweb
[ https://issues.apache.org/jira/browse/LUCENE-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1572: --- Fix Version/s: 2.9 luceneweb -- Key: LUCENE-1572 URL: https://issues.apache.org/jira/browse/LUCENE-1572 Project: Lucene - Java Issue Type: Bug Components: Examples Affects Versions: 2.4 Environment: Windows XP Reporter: kysnail Priority: Minor Fix For: 2.9 Lucene versin : lucene-2.4.0 According to the reference doc, can't run the luceneweb correctly. But if use version (lucene-1.4.3 ), it can working properly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote: think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). I like your suggestion for the name. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
You're right ... I missed that. My fault :) On Thu, Mar 26, 2009 at 12:18 PM, Michael McCandless luc...@mikemccandless.com wrote: Mark Miller markrmil...@gmail.com wrote: bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. MRHC was created because simply adding setNextReader to HC would break back compat, because collect(...) is called on the un-rebased doc. Ie we need a new class so we can tell that it will handle re-basing the doc itself. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I'd say it is a bad name. Raw hit is way far from being result of a search. If you're already breaking back compat with 3.0 release (by incrementing java version), maybe its worthy to break it in some more places, just so ugly names like MRHC and special code paths that check for n-year-old interfaces won't haunt us for the next century. On Thu, Mar 26, 2009 at 14:15, DM Smith dmsmith...@gmail.com wrote: On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote: think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). I like your suggestion for the name. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d
IndexWriter does not do the right thing when a Thread is interrupt()'d -- Key: LUCENE-1573 URL: https://issues.apache.org/jira/browse/LUCENE-1573 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Spinoff from here: http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html When a Thread is interrupt()'d while inside Lucene, there is a risk currently that it will cause a spinloop and starve BG merges from completing. Instead, when possible, we should allow interruption. But unfortunately for back-compat, we will need to wrap the exception in an unchecked version. In 3.0 we can change that to simply throw InterruptedException. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Earwin Burrfoot ear...@gmail.com wrote: I'd say it is a bad name. Raw hit is way far from being result of a search. First off, from Lucene's standpoint, the docID *is* the result of the search. Your application will do further things (load titles, do higlighting, etc.) with that result. Second off, since ResultCollector is an abstract base class, it would be subclassed to concrete versions that do more interesting things (call Scorer.score(), etc) so as to make up what your application considers the result. So I understand your objection, but I still feel ResultCollector is an OK name. Or do you have an alternative? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? Shai On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene
Re: Is TopDocCollector's collect() implementation correct?
BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector. That is a problem if we want to make TSDC final. Now, TFDC is marked deprecated, so it will be removed in the future. I think an easy fix is just to have TFDC extend TopResultsCollector, right? On Thu, Mar 26, 2009 at 2:52 PM, Shai Erera ser...@gmail.com wrote: I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? Shai On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a
Re: Is TopDocCollector's collect() implementation correct?
BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? What this class consumes and what it produces is a totally different thing. HitCollector always collects 'hits', and then produces whatever implementor needs. For example mine collects hits, then collapses 1..N sequential hits into a 'metahit', calculates facets, sorts, takes top and loads some fields. And another one simply counts the hits without doing anything else. But oh, my, I'm not implementing anything like void collect(Facet f) method. It's common sense to name consumer interfaces after what they consume, not what their implementations might do. Or do you have an alternative? HitCollector is absolutely cool with me. Okay, maybe DocCollector, or DocIdCollector. Since that is exactly what 'all' of its implementations do. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey NitPicker - that absolutely made my day -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... How about dropping the word Collector, since it might not collect anything, and just save the highest score, or compute some facets .. What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). On Thu, Mar 26, 2009 at 3:48 PM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. OK I agree. And with our proposed changes (TopResultsCollector), you can do this. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). OK, thanks. I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. Though... this is somewhat tricky to implement efficiently when using pqueue: you pop the worst scoring hit first, then next worst scoring, etc, into an array (in reverse order). It would be conceivable to do a [separate] partial sort of the queue to more efficiently retrieve a top subset of N to save some time on the extraction. But my guess is extraction time is trivial; I don't think we need to optimize it. That being said, we could make the API like this, but under the hood simply do what we do today the first time it's called, leaving as a future optimization to speed it up. Alternatively we could make a ScoreDoc result(int n) to retrieve each result one by one... or maybe doc(int n) and score(int n) since some collectors won't score (but, then we'd need to handle FieldDoc, which is used to more generically return the sort field values for each result). BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? I like it too! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector. Weird. Probably we could put that back to extending [deprecated] TopDocCollector? That is a problem if we want to make TSDC final. Now, TFDC is marked deprecated, so it will be removed in the future. Right. I think an easy fix is just to have TFDC extend TopResultsCollector, right? Or, back to the way it was pre-1483 (extend TopDocCollector). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
In 1483 Doug had also suggested: * Hitable I suppose Collector shouldn't really be in the name, since the class may not actually collect the results (eg if it simply counts). Mike Marvin Humphrey mar...@rectangular.com wrote: On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: List Moderators
: Every now and again, someone emails me off list asking to be removed from the : list and I always forward them to Erik, b/c I know he is a moderator. : However, I was wondering who else is besides Erik, since, AIUI, there needs to : be at least 3 in ASF-land, right? : : So, if you're a list moderator for dev/user, please stand up. the docs for say committers have instructions for checking the moderators for any list, however the process seems to no longer work (probably because mail handling got moved onto a different box)... http://www.apache.org/dev/committers.html#mailing-list-moderators https://svn.apache.org/repos/private/committers/docs/resources.txt ...might be worth following up with INFRA to sanity check the list of moderators on all lucene lists, make sure we have three *active* moderators on each list. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: List Moderators
Is it also worth while to check if a static signature can be added to mails with instructions Or a link to the apache mail instructions? It will reduce a lot of repeat questions. On Thu, Mar 26, 2009 at 2:46 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Every now and again, someone emails me off list asking to be removed from the : list and I always forward them to Erik, b/c I know he is a moderator. : However, I was wondering who else is besides Erik, since, AIUI, there needs to : be at least 3 in ASF-land, right? : : So, if you're a list moderator for dev/user, please stand up. the docs for say committers have instructions for checking the moderators for any list, however the process seems to no longer work (probably because mail handling got moved onto a different box)... http://www.apache.org/dev/committers.html#mailing-list-moderators https://svn.apache.org/repos/private/committers/docs/resources.txt ...might be worth following up with INFRA to sanity check the list of moderators on all lucene lists, make sure we have three *active* moderators on each list. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d
[ https://issues.apache.org/jira/browse/LUCENE-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1573: --- Attachment: LUCENE-1573.patch Attached patch. All tests pass, including a new one that showed the deadlock. I also found fixed a case where IndexWriter would hang during close (thinking a BG merge was still running when it wasn't) if the InterruptedException arrived at the right time. I'll commit in a day or two. IndexWriter does not do the right thing when a Thread is interrupt()'d -- Key: LUCENE-1573 URL: https://issues.apache.org/jira/browse/LUCENE-1573 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1573.patch Spinoff from here: http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html When a Thread is interrupt()'d while inside Lucene, there is a risk currently that it will cause a spinloop and starve BG merges from completing. Instead, when possible, we should allow interruption. But unfortunately for back-compat, we will need to wrap the exception in an unchecked version. In 3.0 we can change that to simply throw InterruptedException. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689624#action_12689624 ] Uwe Schindler commented on LUCENE-831: -- I will attach my comments regarding the problem with the TrieRangeFilter and sorting (stop collecting terms into cache when lower precisions begin or only collect terms using a specific range (like a range filter). So you could fill a FieldCache and specify a starting term and ending term, all terms inbetween could be put into the cache, others outside left out. In this way, it would be possible to just use TrieUtils.prefixCodeLong() to specify the upper and lower integer bound encoded in the highest precision. Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689629#action_12689629 ] Tim Smith commented on LUCENE-831: -- One requirement i would like to request is the ability to attach an arbitrary object to each Segment. This will allow people using lucene to store any arbitrary per segment caches and statistics that their application requires (fully free form) Would like to see the following: * add SegmentReader.setCustomCacheManager(CacheManager m) // mabye add a string for a CacheManager id (to allow registration of multiple cache managers) * add SegmentReader.getCustomCacheManager() // to allow accessing the manager CacheManager should be a very light interface (just a close() method that is called when the SegmentReader is closed) Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Alves updated LUCENE-1567: --- Attachment: lucene_trunk_FlexQueryParser_2009March26.patch Here is an updated version of the patch with minor fixes. This version does not delete the old lucene queryparser. build.xml default, javadocs, test-core all run fine. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Michael Busch Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote: I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... Another advantage of ResultsCollector is that the name suggests good self-documenting subclass names and variable names. For instance, it's reasonably clear what a BitSetCollector or a TopDocsCollector might do. And when there's only one var around, the name collector is an obvious choice no matter what the class. This is all possible because there's no other use of Collector within Lucene. I just think ResultsCollector is less euphonic and zippy than HitCollector, so it's worth exploring alternatives. How about dropping the word Collector, since it might not collect anything, and just save the highest score, or compute some facets .. Sure, wiping the slate clean and re-examining HitCollector's actual purpose and usage to discover new names is a good approach. Similar thinking went into Hitable and Harvester. FWIW, I'd have to disagree that HitCollector doesn't collect anything. It may not collect hits per se, but it's definitely iterating over hits (in the sense of successful matches) and with only rare exceptions, it's collecting *something*. What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). Considering how we attach HITCOLLECTORTHINGY onto the matching process is a novel take and clarifying to see. However, maybe it's just me, but *Listener evokes the JavaScript EventListener stuff, which seems radically different. Also, if I saw a listener variable in scoring loop code, or a TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it would be doing what a HitCollector does right now. Cheers, Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689680#action_12689680 ] Luis Alves commented on LUCENE-1567: HI Grant and Micheal I faxed the CLA today. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Michael Busch Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands,
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 5:28 PM, Marvin Humphrey mar...@rectangular.com wrote: On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote: I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... Another advantage of ResultsCollector is that the name suggests good self-documenting subclass names and variable names. For instance, it's reasonably clear what a BitSetCollector or a TopDocsCollector might do. And when there's only one var around, the name collector is an obvious choice no matter what the class. This is all possible because there's no other use of Collector within Lucene. I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). Considering how we attach HITCOLLECTORTHINGY onto the matching process is a novel take and clarifying to see. However, maybe it's just me, but *Listener evokes the JavaScript EventListener stuff, which seems radically different. Also, if I saw a listener variable in scoring loop code, or a TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it would be doing what a HitCollector does right now. Yeah I'm not really a fan of Listener either. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 06:03:07PM -0400, Michael McCandless wrote: But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). +1 Honorable mention to NitPicker, LOL. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). That's exactly what I'm using in my app - abstract class Collector extends HitCollector, that serves as a base for all my custom collectors :) So, yeah, I like this name. Yeah I'm not really a fan of Listener either. +1 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Alves updated LUCENE-1567: --- Attachment: (was: lucene_trunk_FlexQueryParser_2009March26.patch) New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Michael Busch Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Updated: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Alves updated LUCENE-1567: --- Attachment: lucene_trunk_FlexQueryParser_2009March26_v3.patch I cleaned up all the javadocs on this one. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Michael Busch Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For
Hudson build is back to normal: Lucene-trunk #778
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/778/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I really liked HItCollector and hate to give up the name ... However Collector is fine with me either, and it is at least more generic than HitCollector ... Hitable sounds too aggressive/violent to me :) BTW, I guess I should create some new searcher API which receives this Collector class (is Collector the chosen name?) and deprecate those who accept HitCollector? Those can also skip the instanceof check, and wrapping of HC to MRHC ... That also means that I should throw that MRHC wrapper (which rebases doc Ids)? If HitCollector is deprecated, then there's no need to keep it. But perhaps we want it there in 2.9 for easier migration? Personally I think it's redundant since in 3.0 people will need to change all their collectors anyway (since HitCollector will be removed, and every class which extends HitCollector will need be modified). What do you think? Also, there's no need to deprecate MRHC, since it's only in the trunk - I can simply rename it, right? Ok I'll go ahead and prepare a patch. We can discuss the name more, at the end it will just be a short refactor action in Eclipse, so that shouldn't hold us (or me) up. Shai On Fri, Mar 27, 2009 at 1:24 AM, Earwin Burrfoot ear...@gmail.com wrote: I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). That's exactly what I'm using in my app - abstract class Collector extends HitCollector, that serves as a base for all my custom collectors :) So, yeah, I like this name. Yeah I'm not really a fan of Listener either. +1 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org