[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717231#action_12717231 ] Michael McCandless commented on LUCENE-1453: Since we've deprecated all methods that are using FSDirectory.getDirectory under-the-hood, why do we even need to fix this? Ie why replace all these with the new FSDir.open, now, when we're just going to remove them in 3.0 anyway? When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717242#action_12717242 ] Uwe Schindler commented on LUCENE-1453: --- Because the error sometimes also occurs with the refcounting directories, but more seldom (because of the refcounting helps to keep the directory open, even when it is closed one time too much). And our problem: we want to really remove this ugly closeDir stuff from IndexReaders, the code is sometimes unreadable and its hard to find out whats going on. When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717244#action_12717244 ] Michael McCandless commented on LUCENE-1453: But the refcounting is also deprecated? And, IndexReader will no longer track closeDir in 3.0, since that's only set to true in the deprecated methods? bq. Because the error sometimes also occurs with the refcounting directories, Oh, you mean there is an intermittent failure on the current trunk? (Ie, when using FSDir.getDirectory under the hood). When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: WebLuke - include Jetty in Lucene binary distribution?
Hey John, I like WebLuke too, but am not sure what ever became of it. It seemed like it had a lot of traction (http://www.lucidimagination.com/search/document/3b06db2b12dffb70/webluke_include_jetty_in_lucene_binary_distribution ) but that the main objection was the size of the GWT stuff and a Web Server as part of the distribution. Not sure whether Mark has been maintaining it or not. In other words, I'm +1 for WebLuke (and Luke, for that matter, although I know it has some GPL components) being a part of Lucene, even if, just maybe, it isn't part of the main distribution. -Grant On Jun 5, 2009, at 11:27 PM, John Wang wrote: Hi guys: I am interested in what is the latest decision on webluke - I downloaded the zip, tried it and love it! Does it support all Luke's functionality? (especially the plugin support) Thanks -John On Sun, Apr 27, 2008 at 7:09 AM, Uwe Schindler u...@thetaphi.de wrote: Here another Servlet 2.3 compatible container: http://panfmp.svn.sourceforge.net/viewvc/panfmp/tools/mini-webserver/trunk/ It does not support web.xml files (instead uses a simple properties file), but it supports almost everything needed to get simple servlets running with path mappings etc. The support for web.xml was left out because of compatibility with very old java versions without xml support and to keep it small. JAR file is about 39 KB plus servlet.jar version 2.3 without JSP classes (31 KB) and commons-logging. We use it currenty for a CD-ROM based Lucene search engine. It's licensed in Apache 2.0 and Java 1.3 compatible (no generics, StringBuffer). The SVN currenty lacks documentation and startup shell scripts, but a working config file is supplied. The SVN contains a little bit more jar files, but needed is only webserver.jar, servlet-2.3.jar and commons-logging.jar. Some features are, that the static content servlet can serve files directly from ZIP files (e.g., http://localhost/file.zip/some/example.txt). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Nadav Har'El [mailto:n...@math.technion.ac.il] Sent: Sunday, April 27, 2008 3:08 PM To: java-dev@lucene.apache.org Subject: Re: WebLuke - include Jetty in Lucene binary distribution? On Sun, Dec 09, 2007, markharw00d wrote about WebLuke - include Jetty in Lucene binary distribution?: The only open question is if we should bundle Jetty in the Lucene binary distribution as part of the build packaging. This could be used to launch both WebLuke and the existing luceneweb.war but adds about 6 or 7 meg to the overall zipped download size. Thoughts? My thoughts is that 6-7 MB for a tiny HTTP Server and/or servlet engine is way, way, too much. I'm surprise that Jetty, originally intended to be simple and embeddable, reached that size (which is 10 times larger than Lucene's core, for example)! For demo purposes, I wrote myself something similar, and its (uncompressed) .class size is: 14 K for the basic HTTP server 24 K for the servlet container (jaxax.servlet API support) And there's also the Servlet API itself from Sun, at around 40 K (this is part of J2EE but not of J2SE, so you need to include this as well if you want to use the servlet API). And that's it. I'm sure that similar tiny Web Servers can also be found on the Web, but if there's interest, I can see about publishing mine. -- Nadav Har'El| Sunday, Apr 27 2008, 22 Nisan 5768 IBM Haifa Research Lab |- |Why do we drive on a parkway and park on http://nadav.harel.org.il |a driveway? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: WebLuke - include Jetty in Lucene binary distribution?
Hi John/Grant. I haven't done any more in developing WebLuke - although still use it regularly. As Grant suggests there was an unease (mine) about bloating the Lucene distribution size with GWT dependencies so it wasn't rolled into contrib. However I guess I'm comfortable if no one else is concerned about this. The GWT skin is useful for remote working but I think Luke could/should be built with a front-end-independent back end leaving the door open for Swing or SWT front-ends for work with local indexes. The current Thinlet skin is the piece that has the unfortunate GPL dependency. GWT is Apache licensed and so would be OK. I would probably need to upgrade WebLuke to the latest version of GWT prior to any contribution and would also like to de-GWT-ize the back end. I guess the main question is how to manage/build/package the contrib section given WebLuke could bring in Jetty and we already have 2 web-based contrib demos in there that could use this too. Cheers Mark From: Grant Ingersoll gsing...@apache.org To: java-dev@lucene.apache.org Sent: Monday, 8 June, 2009 14:03:49 Subject: Re: WebLuke - include Jetty in Lucene binary distribution? Hey John, I like WebLuke too, but am not sure what ever became of it. It seemed like it had a lot of traction (http://www.lucidimagination.com/search/document/3b06db2b12dffb70/webluke_include_jetty_in_lucene_binary_distribution) but that the main objection was the size of the GWT stuff and a Web Server as part of the distribution. Not sure whether Mark has been maintaining it or not. In other words, I'm +1 for WebLuke (and Luke, for that matter, although I know it has some GPL components) being a part of Lucene, even if, just maybe, it isn't part of the main distribution. -Grant On Jun 5, 2009, at 11:27 PM, John Wang wrote: Hi guys: I am interested in what is the latest decision on webluke - I downloaded the zip, tried it and love it! Does it support all Luke's functionality? (especially the plugin support) Thanks -John On Sun, Apr 27, 2008 at 7:09 AM, Uwe Schindler u...@thetaphi.de wrote: Here another Servlet 2.3 compatible container: http://panfmp.svn.sourceforge.net/viewvc/panfmp/tools/mini-webserver/trunk/ It does not support web.xml files (instead uses a simple properties file), but it supports almost everything needed to get simple servlets running with path mappings etc. The support for web.xml was left out because of compatibility with very old java versions without xml support and to keep it small. JAR file is about 39 KB plus servlet.jar version 2.3 without JSP classes (31 KB) and commons-logging. We use it currenty for a CD-ROM based Lucene search engine. It's licensed in Apache 2.0 and Java 1.3 compatible (no generics, StringBuffer). The SVN currenty lacks documentation and startup shell scripts, but a working config file is supplied. The SVN contains a little bit more jar files, but needed is only webserver.jar, servlet-2.3.jar and commons-logging.jar. Some features are, that the static content servlet can serve files directly from ZIP files (e.g., http://localhost/file.zip/some/example.txt). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Nadav Har'El [mailto:n...@math.technion.ac.il] Sent: Sunday, April 27, 2008 3:08 PM To: java-dev@lucene.apache.org Subject: Re: WebLuke - include Jetty in Lucene binary distribution? On Sun, Dec 09, 2007, markharw00d wrote about WebLuke - include Jetty in Lucene binary distribution?: The only open question is if we should bundle Jetty in the Lucene binary distribution as part of the build packaging. This could be used to launch both WebLuke and the existing luceneweb.war but adds about 6 or 7 meg to the overall zipped download size. Thoughts? My thoughts is that 6-7 MB for a tiny HTTP Server and/or servlet engine is way, way, too much. I'm surprise that Jetty, originally intended to be simple and embeddable, reached that size (which is 10 times larger than Lucene's core, for example)! For demo purposes, I wrote myself something similar, and its (uncompressed) .class size is: 14 K for the basic HTTP server 24 K for the servlet container (jaxax.servlet API support) And there's also the Servlet API itself from Sun, at around 40 K (this is part of J2EE but not of J2SE, so you need to include this as well if you want to use the servlet API). And that's it. I'm sure that similar tiny Web Servers can also be found on the Web, but if there's interest, I can see about publishing mine. -- Nadav Har'El| Sunday, Apr 27 2008, 22 Nisan 5768 IBM Haifa Research Lab |- |Why do we drive on a parkway and park on http://nadav.harel.org.il |a driveway?
Some thoughts around the use of reader.isDeleted and hasDeletions
Hi I recently read CHANGES to learn more about the readOnly parameter IndexReader now supports, and came across LUCENE-1329 with a comment that isDeleted was made not synchronized if readOnly=true (e.g. ReadOnlyIndexReader), which can affect search code, as it is usually the bottleneck for search operations. I searched the code and was surprised to see isDeleted and hasDeletions are not called from any search code. Instead, the code, such as SegmentTermDocs, uses the deletedDocs instance directly. So in fact isDeleted wasn't the bottleneck (unless this was part of the changes in 1329). Anyway, doesn't matter, that's good ! However, I did find out some indexing code whic calls these two, like SegmentMerger when it merges fields and term vectors. I think that we can improve that code by writing some specialized code for merging - if the reader has no deletions, there's no point checking for each document if there are deletions and/or if the document was deleted. In fact, SegmentMerger checks for each document: (1) whether the reader has deletions and if the document was deleted, (2) if the reader has a matching reader and (3) if checkAbort is not null. I have a couple of suggestions to simplify that code: 1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to get rid of the unnecessary if (hasDeletions) check for each doc. 2. In these, check if the reader has matching reader or not, and execute the appropriate code in each case. 3. Also, check up front if checkAbort != null. 3.1 (3) duplicates the code - so perhaps instead I can create a dummy checkAbort, which does nothing? That way we'll always call checkAbort.work(). But this adds a method call, so I'm not sure. (same optimizations for mergeVectors()). In addition, I think something can be done w/ SegmentTermDocs. Either create a specialized STD based on whether it has deletions or not, or create a dummy BitVector which returns false for every check. That way we can eliminate the checks in each next(), skipTo(). Dummy BitVector will leave the file as-is, but will add a method call, so I think I lean towards the specialized STD. This can be as simple as making STD abstract, with a static create() method which creates the right instance. I believe there are other places where we can make such optimizations. If the overall direction seems right to you, I can open an issue and start to work on a patch. Currently I've patched SegmentMerger, and besides the class getting bigger, nothing bad happened (meaning all tests pass), though it might be interesting to check how it affects performance. What do you think? Shai
[jira] Updated: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1453: -- Attachment: LUCENE-1453.patch This is the solution using the FilterIndexReader, all tests now pass (with refcounting deprectated dirs as well as FSDir.open-dirs, see next Patch). The solution consists of two parts: - All closeDirectory stuff is removed from DirectoryIndexReader (even the ugly FSDir cloning) and from ReadOnlyDirectoryIndexReader; the code is now simplier to understand. It is now on the status for 3.0, no deprecated helper stuff anymore in these internal classes. So they can be used in 3.0 without modifications. - As the DirectoryIndexReader is not closing the directory anymore, the deprectated IndexReader.open methods taking String/File would not work anymore correctly (because they miss to close the dir on closing). To fix this easily, a deprectated private class extends FIlterIndexReader was added, that wraps around the DirectoryIndexReader, when File/String opens are used. This class keeps a refcounter that is incremented on reopen/clone and decremented on doClose(). The last doClose, closes the directory. In 3.0 this class can be removed easily with all File/String open() methods. I could remove this class from IndexReader.java and put in a separate package-private file, if you like. I would like to have this in 2.9, to get rid of these ugly closeDirectory hacks! All tests pass (I retried TestIndexReaderReopen about hundred times and no variant fails anymore). It also works, when replacing the refcounting FSDir.getDirectory by FSDir.open() calls (see next patch). When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1453: -- Attachment: LUCENE-1453-with-FSDir-open.patch This is a variant for testing the same with FSDir.open(). As you see, the reopening now also works correctly here and the underlying directory is not closed too often. This patch is for demonstration only. When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717486#action_12717486 ] Michael Busch commented on LUCENE-1567: --- Is it mostly internal stuff you need to change to compile with 1.4, or do also a lot of public APIs use generics? New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717492#action_12717492 ] Adriano Crestani commented on LUCENE-1567: -- It's mostly internal stuffs, the only api that uses generics is QueryNode tha returns ListQueryNode and receives it as param, I actually don't think it's a big deal :p New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to
Re: [jira] Commented: (LUCENE-1567) New flexible query parser
I actually think we should give the parser to contrib on 2.9 using jdk 1.5 syntax and move it to main on 3.0 using jdk1.5 syntax. I don't think it's a small change, and will affect the interfaces, and all implementations of QueryNode Objects. I would see nothing wrong with having a jdk 1.4 version if we were 100% compatible with the old queryparser, but since that is not the case. (the wrapper we built does not support the case where users extend the old queryparser class and overwrite methods to add new functionality) Adriano Crestani (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717492#action_12717492 ] Adriano Crestani commented on LUCENE-1567: -- It's mostly internal stuffs, the only api that uses generics is QueryNode tha returns ListQueryNode and receives it as param, I actually don't think it's a big deal :p New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717535#action_12717535 ] Luis Alves commented on LUCENE-1567: I actually think we should give the parser to contrib on 2.9 using jdk 1.5 syntax and move it to main on 3.0 using jdk1.5 syntax. I don't think it's a small change and this change will affect the interfaces and future versions of the parser (to be 1.4 compatible). I would see nothing wrong with having a jdk 1.4 version if we were 100% compatible with the old queryparser, but since that is not the case, I don't think it is worth it. (the wrapper we built does not support the case where users extend the old queryparser class and overwrite methods to add new functionality) If everyone else thinks making the queryparser interfaces 1.4 compatible is a must, I will be OK with it. But only if we actually move the new queryparser to main on 2.9 and break the compatibility with the old lucene Queryparser class, for users that are extending this class. The new queryparser supports 100% on the syntax, and 100% of the lucene Junits. But does not support users that extended the QueryParser class and overwrote some methods. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717543#action_12717543 ] Adriano Crestani commented on LUCENE-1567: -- I went through the new QP and listed what exactly needs to be changed: QueryNode class has 2 methods: set(ListQueryNode), add(ListQueryNode) and ListQueryNode getChildren(). All the generics would be removed. I don't see any back compatibility problem if we add generics in future, we could hardcode the type checking if we release with 1.4 and any user impl of this class will need to do the same and follow the documentation. ModifierQueryNode has an enum called Modifier with values MOD_NOT, MOD_NONE and MOD_REQ. An enum can be almost completely reproduced on 1.4 using: ... final public static class Modifier implements Serializable { final public static Modifier MOD_NOT = new Modifier(); final public static Modifier MOD_NOT = new Modifier(); final public static Modifier MOD_NOT = new Modifier(); private Modifier() { // empty constructor } // we might add some Enum methods, like name(), etc... } ... The only back compatibility problem I see when we change the Modifier to enum again is if on the version 1.4 the user checks for Modifier.class.isEnum()...does anybody see any other back-compatibility issue? The last thing that will need to be changed is on the QueryBuilder and LuceneQueryBuilder. The QueryBuilder.build() returns an Object and when LuceneQueryBuilder implements it, it specializes the return to Query, which will start throwing Object instead if we change to 1.4. On this case I don't see any back-compatibility issue also. Regarding the new QP framework, I don't see any problem about back compatibility, because Lucene will only be Java 1.5 on version 3.0, and back compatibility may be broken. But... I would see nothing wrong with having a jdk 1.4 version if we were 100% compatible with the old queryparser, but since that is not the case, I don't think it is worth it. (the wrapper we built does not support the case where users extend the old queryparser class and overwrite methods to add new functionality) I agree with Luis, if we only release the new QP framework 2.9, we will definitely brake the back-compatiblity of the old QP, so, why not release the old and the new QP together on 2.9? Suggestions? :) Best Regards, Adriano Crestani Campos Adriano Crestani Campos New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717560#action_12717560 ] Luis Alves commented on LUCENE-1567: There will be a couple of more changes need: We also have to change List change QueryNode getchildren(); and public MapCharSequence, Object getTags(); We also have change QueryNodeImpl, we will have to patch all QueryNode classes implementations and perform forced casts. and users implementing QueryNode's will also have to do that. It's about 30 changes, not that a big change, I agree. But if we release both parsers I see no need to change it. I agree with Luis, if we only release the new QP framework 2.9, we will definitely brake the back-compatiblity of the old QP, so, why not release the old and the new QP together on 2.9? Some extras: If we chose to release both parsers, we should deprecate the old one, allowing people to migrate to the new one with release 2.9. and drop the old queryparser classes on 3.0. (we can keep the wrappers in 2.9 throwing exceptions in all methods to remind people to move to the new framework we probably can also keep the wrapper in 3.0, if we think is still necessary). New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper.