Re: Solr 1.5 or 2.0?
On Thu, Nov 19, 2009 at 2:53 AM, Yonik Seeley yo...@lucidimagination.com wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x My first feeling is that Solr 2.0 with Lucene 3.x would be a clean cut. What is your back compat policy for major version jumps? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2075) Share the Term - TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2075: -- Attachment: LUCENE-2075.patch Updated patch, adds missing @Overrides, we added in 3.0 and also makes the private PQ implement Iterable, the markAndSweep code is now synactical sugar :-) Share the Term - TermInfo cache across threads --- Key: LUCENE-2075 URL: https://issues.apache.org/jira/browse/LUCENE-2075 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch Right now each thread creates its own (thread private) SimpleLRUCache, holding up to 1024 terms. This is rather wasteful, since if there are a high number of threads that come through Lucene, you're multiplying the RAM usage. You're also cutting way back on likelihood of a cache hit (except the known multiple times we lookup a term within-query, which uses one thread). In NRT search we open new SegmentReaders (on tiny segments) often which each thread must then spend CPU/RAM creating populating. Now that we are on 1.5 we can use java.util.concurrent.*, eg ConcurrentHashMap. One simple approach could be a double-barrel LRU cache, using 2 maps (primary, secondary). You check the cache by first checking primary; if that's a miss, you check secondary and if you get a hit you promote it to primary. Once primary is full you clear secondary and swap them. Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-1799) Unicode compression
Hi Robert, On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the collation support, we could maybe improve IndexableBinaryStringTools by using char[]/byte[] with offset and length. The existing ByteBuffer/CharBuffer methods could stay, they are consistent with Charset api and are not wrong imo, but instead defer to the new char[]/byte[] ones... the current buffer-based ones require the buffer to have a backing array anyway or will throw an exception. +1 I used *Buffers because I thought it simplified method prototypes, no other reason. Steve
RE: Solr 1.5 or 2.0?
We also had some (maybe helpful) opinions :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, November 19, 2009 3:31 PM To: java-dev@lucene.apache.org Subject: Re: Solr 1.5 or 2.0? Oops... of course I meant to post this in solr-dev. -Yonik http://www.lucidimagination.com On Wed, Nov 18, 2009 at 8:53 PM, Yonik Seeley yo...@lucidimagination.com wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
option 3 looks best . But do we plan to remove anything we have not already marked as deprecated? On Thu, Nov 19, 2009 at 8:10 PM, Uwe Schindler u...@thetaphi.de wrote: We also had some (maybe helpful) opinions :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, November 19, 2009 3:31 PM To: java-dev@lucene.apache.org Subject: Re: Solr 1.5 or 2.0? Oops... of course I meant to post this in solr-dev. -Yonik http://www.lucidimagination.com On Wed, Nov 18, 2009 at 8:53 PM, Yonik Seeley yo...@lucidimagination.com wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Noble Paul | Principal Engineer| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1799) Unicode compression
Steven, do you still have a test setup to measure collation key generation performance with Lucene? On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote: Hi Robert, On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the collation support, we could maybe improve IndexableBinaryStringTools by using char[]/byte[] with offset and length. The existing ByteBuffer/CharBuffer methods could stay, they are consistent with Charset api and are not wrong imo, but instead defer to the new char[]/byte[] ones... the current buffer-based ones require the buffer to have a backing array anyway or will throw an exception. +1 I used *Buffers because I thought it simplified method prototypes, no other reason. Steve -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-1799) Unicode compression
[ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780129#action_12780129 ] DM Smith commented on LUCENE-1799: -- The sample code is probably what is on this page, here: http://unicode.org/notes/tn6/#Sample_Code From what I gather reading the whole page: If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license. There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification? As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[]. Unicode compression --- Key: LUCENE-1799 URL: https://issues.apache.org/jira/browse/LUCENE-1799 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.4.1 Reporter: DM Smith Priority: Minor In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index. This led to the comment that a different or compressed encoding would be a generally useful feature. BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained. SCSU is another Unicode compression algorithm that could be used. An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-1799) Unicode compression
Hi Robert, Ack, actually two days ago I updated my Lucene trunk checkout and removed that code, thinking its utility had evaporated! But maybe IntelliJ will save my bacon in its local history cache. (Praise IntelliJ!) I'll check tonight when I get home. Steve On 11/19/2009 at 10:16 AM, Robert Muir wrote: Steven, do you still have a test setup to measure collation key generation performance with Lucene? On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote: Hi Robert, On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the collation support, we could maybe improve IndexableBinaryStringTools by using char[]/byte[] with offset and length. The existing ByteBuffer/CharBuffer methods could stay, they are consistent with Charset api and are not wrong imo, but instead defer tothe new char[]/byte[] ones... the current buffer-based ones require the buffer to have a backing array anyway or will throw an exception. +1 I used *Buffers because I thought it simplified method prototypes, no other reason. Steve -- Robert Muir rcm...@gmail.com
Re: [jira] Commented: (LUCENE-1799) Unicode compression
doh! well if you have it, that will be very handy for verification. I'll create a separate issue for this shortly, maybe you can review the patch Thanks, Robert On Thu, Nov 19, 2009 at 1:06 PM, Steven A Rowe sar...@syr.edu wrote: Hi Robert, Ack, actually two days ago I updated my Lucene trunk checkout and removed that code, thinking its utility had evaporated! But maybe IntelliJ will save my bacon in its local history cache. (Praise IntelliJ!) I'll check tonight when I get home. Steve On 11/19/2009 at 10:16 AM, Robert Muir wrote: Steven, do you still have a test setup to measure collation key generation performance with Lucene? On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote: Hi Robert, On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the collation support, we could maybe improve IndexableBinaryStringTools by using char[]/byte[] with offset and length. The existing ByteBuffer/CharBuffer methods could stay, they are consistent with Charset api and are not wrong imo, but instead defer tothe new char[]/byte[] ones... the current buffer-based ones require the buffer to have a backing array anyway or will throw an exception. +1 I used *Buffers because I thought it simplified method prototypes, no other reason. Steve -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780193#action_12780193 ] David Kaelbling commented on LUCENE-2039: - I apologize if I haven't read the comments carefully enough, but in LUCENE-2039_field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039.patch the complex phrase parser picked up the changes for free. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
remove Byte/CharBuffer wrapping for collation key generation Key: LUCENE-2084 URL: https://issues.apache.org/jira/browse/LUCENE-2084 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2084.patch We can remove the overhead of ByteBuffer and CharBuffer wrapping in CollationKeyFilter and ICUCollationKeyFilter. this patch moves the logic in IndexableBinaryStringTools into char[],int,int and byte[],int,int based methods, with the previous Byte/CharBuffer methods delegating to these. Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2084: Attachment: LUCENE-2084.patch remove Byte/CharBuffer wrapping for collation key generation Key: LUCENE-2084 URL: https://issues.apache.org/jira/browse/LUCENE-2084 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2084.patch We can remove the overhead of ByteBuffer and CharBuffer wrapping in CollationKeyFilter and ICUCollationKeyFilter. this patch moves the logic in IndexableBinaryStringTools into char[],int,int and byte[],int,int based methods, with the previous Byte/CharBuffer methods delegating to these. Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation
[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2084: Priority: Minor (was: Major) remove Byte/CharBuffer wrapping for collation key generation Key: LUCENE-2084 URL: https://issues.apache.org/jira/browse/LUCENE-2084 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2084.patch We can remove the overhead of ByteBuffer and CharBuffer wrapping in CollationKeyFilter and ICUCollationKeyFilter. this patch moves the logic in IndexableBinaryStringTools into char[],int,int and byte[],int,int based methods, with the previous Byte/CharBuffer methods delegating to these. Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780193#action_12780193 ] David Kaelbling edited comment on LUCENE-2039 at 11/19/09 8:22 PM: --- I apologize if I haven't read the comments carefully enough, but in LUCENE-2039_field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039.patch the complex phrase parser picked up the changes for free. And would RegexParserExtension maybe be easier to use if it set the RegexCapabilities on the new RegexQuery it is returning? was (Author: dkaelbl...@blackducksoftware.com): I apologize if I haven't read the comments carefully enough, but in LUCENE-2039_field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039.patch the complex phrase parser picked up the changes for free. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780254#action_12780254 ] Robert Muir commented on LUCENE-2039: - Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities. This is only one possible implementation of regex support and has some scalability problems. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780258#action_12780258 ] Simon Willnauer commented on LUCENE-2039: - bq. That means (for example) that ComplexPhraseQueryParser cannot subclass it This patch was not meant to include ComplexPhraseQueryParser it is rather a proposal for the concept of field overloading. But you are right the parser should not be final at all especially if you wanna override a get*query method it should be expendable. bq. Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities. This is only one possible implementation of regex support and has some scalability problems. Also true, but again this is just a POC to show how it would look like. Comments on the concept would be more useful by now. I did write that up during a train ride and aimed to get some comments. I already have worked on it and will upload a new patch soon which includes RegexCapabilities + tests. Thanks again for the pointer with the final class. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2039: Attachment: LUCENE-2039_field_ext.patch Updated the patch - removed final modifier from ExtendableQueryParser - added RegexCapabilities ctor to RegexParserExtension I still need to work on the Extensions JavaDoc - and I'm not too happy with the name. Comments on the concept are very welcome. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2085) Update PayloadSpanUtil
Update PayloadSpanUtil -- Key: LUCENE-2085 URL: https://issues.apache.org/jira/browse/LUCENE-2085 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1 Reporter: Mark Miller Assignee: Mark Miller -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780316#action_12780316 ] Mark Miller commented on LUCENE-2039: - It looks like the patch puts this in core? Any compelling reason? Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser. Regex support and beyond in JavaCC QueryParser -- Key: LUCENE-2039 URL: https://issues.apache.org/jira/browse/LUCENE-2039 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Simon Willnauer Assignee: Grant Ingersoll Priority: Minor Fix For: 3.1 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable parser extension with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources. The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though. Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like: {code} protected Query newRegexQuery(Term t) { ... } {code} which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser. I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
I would love to set goals that are ~3 months out so that we don't have another 1 year release cycle. For a 2.0 release where we could have more back-compatibly flexibility, i would love to see some work that may be too ambitious... In particular, the config spaghetti needs some attention. I don't see the need to increment solr to 2.0 for the lucene 3.0 change -- of course that needs to be noted, but incrementing the major number in solr only makes sense if we are going to change *solr* significantly. The lucene 2.x - 3.0 upgrade path seems independent of that to me. I would even argue that with solr 1.4 we have already required many lucene 3.0 changes -- All my custom lucene stuff had to be reworked to work with solr 1.4 (tokenizers multi-reader filters). In general, I wonder where the solr back-compatibility contract applies (and to what degree). For solr, I would rank the importance as: #1 - the URL API syntax. Client query parameters should change as little as possible #2 - configuration #3 - java APIs With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most sense. Unless we see making serious changes to solr that would warrent a major release bump. Lucene has an explict back-compatibility contract: http://wiki.apache.org/lucene-java/BackwardsCompatibility I don't know if solr has one... if we make one, I would like it to focus on the URL syntax+configuration ryan On Nov 18, 2009, at 5:53 PM, Yonik Seeley wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
Ryan McKinley wrote: I would love to set goals that are ~3 months out so that we don't have another 1 year release cycle. For a 2.0 release where we could have more back-compatibly flexibility, i would love to see some work that may be too ambitious... In particular, the config spaghetti needs some attention. I don't see the need to increment solr to 2.0 for the lucene 3.0 change -- of course that needs to be noted, but incrementing the major number in solr only makes sense if we are going to change *solr* significantly. Lucene major numbers don't work that way, and I don't think Solr needs to work that way be default. I think major numbers are better for indicating backwards compat issues than major features with the way these projects work. Which is why Yonik mentions 1.5 with weaker back compat - its not just the fact that we are going to Lucene 3.x - its that Solr still relies on some of the API's that won't be around in 3.x - they are not all trivial to remove or to remove while preserving back compat. The lucene 2.x - 3.0 upgrade path seems independent of that to me. I would even argue that with solr 1.4 we have already required many lucene 3.0 changes -- All my custom lucene stuff had to be reworked to work with solr 1.4 (tokenizers multi-reader filters). Many - but certainly not all. In general, I wonder where the solr back-compatibility contract applies (and to what degree). For solr, I would rank the importance as: #1 - the URL API syntax. Client query parameters should change as little as possible #2 - configuration #3 - java APIs Someone else would likely rank it differently - not everyone using Solr even uses HTTP with it. Someone heavily involved in custom plugins might care more about that than config. As a dev, I just plainly rank them all as important and treat them on a case by case basis. With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most sense. Unless we see making serious changes to solr that would warrent a major release bump. What is a serious change that would warrant a bump in your opinion? Lucene has an explict back-compatibility contract: http://wiki.apache.org/lucene-java/BackwardsCompatibility I don't know if solr has one... if we make one, I would like it to focus on the URL syntax+configuration Its not nice to give people plugins and then not worry about back compat for them :) ryan On Nov 18, 2009, at 5:53 PM, Yonik Seeley wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
CustomScoreQuery Explanation
Hi there - I'm helping out with the Lucene.Net port of 2.9, and when rooting around in CustomScoreQuery.CustomWeight, I noticed what appears to be an unnecessary call to doExplain in the explain method. Current method in trunk: public Explanation explain(IndexReader reader, int doc) throws IOException { Explanation explain = doExplain(reader, doc); return explain == null ? new Explanation(0.0f, no matching docs) : doExplain(reader, doc); } Is there a reason it shouldn't be: public Explanation explain(IndexReader reader, int doc) throws IOException { Explanation explain = doExplain(reader, doc); return explain == null ? new Explanation(0.0f, no matching docs) : explain); } I might be overlooking something, but it appears to be two calls to doExplain when only one would suffice. Michael Michael Garski Sr. Search Architect 310.969.7435 (office) 310.251.6355 (mobile) www.myspace.com/michaelgarski
Re: CustomScoreQuery Explanation
I don't see any reason why doExplain should be called twice. Can you create an issue in jira please? Simon On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com wrote: Hi there – I’m helping out with the Lucene.Net port of 2.9, and when rooting around in CustomScoreQuery.CustomWeight, I noticed what appears to be an unnecessary call to doExplain in the explain method. Current method in trunk: *public* Explanation explain(IndexReader reader, *int* doc) *throws*IOException { Explanation explain = doExplain(reader, doc); *return* explain == *null* ? *new* Explanation(0.0f, no matching docs) : doExplain(reader, doc); } Is there a reason it shouldn’t be: *public* Explanation explain(IndexReader reader, *int* doc) *throws*IOException { Explanation explain = doExplain(reader, doc); *return* explain == *null* ? *new* Explanation(0.0f, no matching docs) : explain); } I might be overlooking something, but it appears to be two calls to doExplain when only one would suffice. Michael Michael Garski Sr. Search Architect 310.969.7435 (office) 310.251.6355 (mobile) www.myspace.com/michaelgarski
RE: CustomScoreQuery Explanation
Will do, along with a patch. Michael From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Thursday, November 19, 2009 4:47 PM To: java-dev@lucene.apache.org Subject: Re: CustomScoreQuery Explanation I don't see any reason why doExplain should be called twice. Can you create an issue in jira please? Simon On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com wrote: Hi there – I’m helping out with the Lucene.Net port of 2.9, and when rooting around in CustomScoreQuery.CustomWeight, I noticed what appears to be an unnecessary call to doExplain in the explain method. Current method in trunk: public Explanation explain(IndexReader reader, int doc) throws IOException { Explanation explain = doExplain(reader, doc); return explain == null ? new Explanation(0.0f, no matching docs) : doExplain(reader, doc); } Is there a reason it shouldn’t be: public Explanation explain(IndexReader reader, int doc) throws IOException { Explanation explain = doExplain(reader, doc); return explain == null ? new Explanation(0.0f, no matching docs) : explain); } I might be overlooking something, but it appears to be two calls to doExplain when only one would suffice. Michael Michael Garski Sr. Search Architect 310.969.7435 (office) 310.251.6355 (mobile) www.myspace.com/michaelgarski
Re: Solr 1.5 or 2.0?
On Nov 19, 2009, at 3:34 PM, Mark Miller wrote: Ryan McKinley wrote: I would love to set goals that are ~3 months out so that we don't have another 1 year release cycle. For a 2.0 release where we could have more back-compatibly flexibility, i would love to see some work that may be too ambitious... In particular, the config spaghetti needs some attention. I don't see the need to increment solr to 2.0 for the lucene 3.0 change -- of course that needs to be noted, but incrementing the major number in solr only makes sense if we are going to change *solr* significantly. Lucene major numbers don't work that way, and I don't think Solr needs to work that way be default. I think major numbers are better for indicating backwards compat issues than major features with the way these projects work. Which is why Yonik mentions 1.5 with weaker back compat - its not just the fact that we are going to Lucene 3.x - its that Solr still relies on some of the API's that won't be around in 3.x - they are not all trivial to remove or to remove while preserving back compat. I confess I don't know the details of the changes that have not yet been integrated in solr -- the only lucene changes I am familiar with is what was required for solr 1.4. The lucene 2.x - 3.0 upgrade path seems independent of that to me. I would even argue that with solr 1.4 we have already required many lucene 3.0 changes -- All my custom lucene stuff had to be reworked to work with solr 1.4 (tokenizers multi-reader filters). Many - but certainly not all. Just my luck... I'm batting 1000 :) But that means my code can upgrade to 3.0 without a issue now! In general, I wonder where the solr back-compatibility contract applies (and to what degree). For solr, I would rank the importance as: #1 - the URL API syntax. Client query parameters should change as little as possible #2 - configuration #3 - java APIs Someone else would likely rank it differently - not everyone using Solr even uses HTTP with it. Someone heavily involved in custom plugins might care more about that than config. As a dev, I just plainly rank them all as important and treat them on a case by case basis. I think it is fair to suggest that people will have the most stable/ consistent/seamless upgrade path if you stick to the HTTP API (and by extension most of the solrj API) I am not suggesting that the java APIs are not important and that back- compatibly is not important. Solr has a some APIs with a clear purpose, place, and intended use -- we need to take these very seriously. We also have lots of APIs that are half baked and loosy goosy. If a developer is working on the edges, i think it is fair to expect more hickups in the upgrade path. With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most sense. Unless we see making serious changes to solr that would warrent a major release bump. What is a serious change that would warrant a bump in your opinion? for example: - config overhaul. detangle the XML from the components. perhaps using spring. - major URL request changes. perhaps we change things to be more RESTful -- perhaps let jersey take care of the URL/request building https://jersey.dev.java.net/ - perhaps OSGi support/control/configuration Lucene has an explict back-compatibility contract: http://wiki.apache.org/lucene-java/BackwardsCompatibility I don't know if solr has one... if we make one, I would like it to focus on the URL syntax+configuration Its not nice to give people plugins and then not worry about back compat for them :) i want to be nice. I just think that a different back compatibility contract applies for solr then for lucene. It seems reasonable to consider the HTTP API, configs, and java API independently. From my perspective, saying solr 1.5 uses lucene 3.0 implies everything a plugin developer using lucene APIs needs to know about the changes. To be clear, I am not against bumping to solr 2.0 -- I just have high aspirations (yet little time) for what a 2.0 bump could mean for solr. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: CustomScoreQuery Explanation
No worries - I think its a bit overkill for the change - I can just pop it in real quick. Michael Garski wrote: Will do, along with a patch. Michael *From:* Simon Willnauer [mailto:simon.willna...@googlemail.com] *Sent:* Thursday, November 19, 2009 4:47 PM *To:* java-dev@lucene.apache.org *Subject:* Re: CustomScoreQuery Explanation I don't see any reason why doExplain should be called twice. Can you create an issue in jira please? Simon On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com mailto:mgar...@myspace-inc.com wrote: Hi there – I’m helping out with the Lucene.Net port of 2.9, and when rooting around in CustomScoreQuery.CustomWeight, I noticed what appears to be an unnecessary call to doExplain in the explain method. Current method in trunk: *public* Explanation explain(IndexReader reader, *int* doc) *throws* IOException { Explanation explain = doExplain(reader, doc); *return* explain == *null* ? *new* Explanation(0.0f, no matching docs) : doExplain(reader, doc); } Is there a reason it shouldn’t be: *public* Explanation explain(IndexReader reader, *int* doc) *throws* IOException { Explanation explain = doExplain(reader, doc); *return* explain == *null* ? *new* Explanation(0.0f, no matching docs) : explain); } I might be overlooking something, but it appears to be two calls to doExplain when only one would suffice. Michael Michael Garski Sr. Search Architect 310.969.7435 (office) 310.251.6355 (mobile) www.myspace.com/michaelgarski http://www.myspace.com/michaelgarski -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.
[ https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780368#action_12780368 ] Erick Erickson commented on LUCENE-2037: Well, last night I changed LocalizedTestCase to do the @RunWith and @Parameterized thing and it works just fine with a minimal change to subclasses, mainly adding @Test and a c'tor with a Locale parameter. Total, it adds probably a minute to the test run. About the cross product of versions and locales. The @Parameterized thingy returns a list of Object[], where the elements of the list are matched against a c'tor. So if each object[] in your list has, say, an (int, float, int), then as long as you have a matching c'tor with a signature that takes an (int, float, int) you're good to go. So to handle the mXn case you mentioned, if your @Parameters method returned a list of object[], one object[] for each Locale, Version pair, you'd get all your Locales run against all your versions. Whether we *want* this to happen or not is another question. It's a worthwhile question whether we really *need* to run all the possible locales or if there's a subset of locales that would serve. It's kind of ironic that I have a patch waiting to be applied that cuts down on the time it takes to run the unit tests and another patch that adds to the time it takes. Two steps forward, one step back and a jink sideways just for fun. Best Erick Allow Junit4 tests in our environment. -- Key: LUCENE-2037 URL: https://issues.apache.org/jira/browse/LUCENE-2037 Project: Lucene - Java Issue Type: Improvement Components: Other Affects Versions: 3.1 Environment: Development Reporter: Erick Erickson Assignee: Erick Erickson Priority: Minor Fix For: 3.1 Attachments: junit-4.7.jar, LUCENE-2037.patch Original Estimate: 8h Remaining Estimate: 8h Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should have to be rewritten. We should start this for the 3.1 release so we can get a clean 3.0 out smoothly. It's probably worthwhile to convert a small set of tests as an exemplar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.
[ https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780388#action_12780388 ] Robert Muir commented on LUCENE-2037: - {quote} It's a worthwhile question whether we really need to run all the possible locales or if there's a subset of locales that would serve. {quote} I won't rant too much on this, except to say that before this localizedtestcase, various parts failed under say, only Korean, or only Thai locale, it was always a corner case. I think its important that someone from say Korea, can download lucene source code, and run 'ant test'. how else are they supposed to contribute if this does not work? Allow Junit4 tests in our environment. -- Key: LUCENE-2037 URL: https://issues.apache.org/jira/browse/LUCENE-2037 Project: Lucene - Java Issue Type: Improvement Components: Other Affects Versions: 3.1 Environment: Development Reporter: Erick Erickson Assignee: Erick Erickson Priority: Minor Fix For: 3.1 Attachments: junit-4.7.jar, LUCENE-2037.patch Original Estimate: 8h Remaining Estimate: 8h Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should have to be rewritten. We should start this for the 3.1 release so we can get a clean 3.0 out smoothly. It's probably worthwhile to convert a small set of tests as an exemplar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780430#action_12780430 ] Mark Miller commented on LUCENE-1606: - So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it :) Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780439#action_12780439 ] Robert Muir commented on LUCENE-1606: - By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780440#action_12780440 ] Mark Miller edited comment on LUCENE-1606 at 11/20/09 5:10 AM: --- {quote} By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. {quote} No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? {quote} but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet. {quote} Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. was (Author: markrmil...@gmail.com): {code} By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. {code} No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to the current WCQ that with this? {quote} but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet. {quote} Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780440#action_12780440 ] Mark Miller commented on LUCENE-1606: - {code} By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. {code} No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to the current WCQ that with this? {quote} but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet. {quote} Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780441#action_12780441 ] Robert Muir commented on LUCENE-1606: - bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem, i mean depending on your term dictionary if you do something stupid like ???abacadaba it probably wont be that fast. I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives. There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests) Because of this, currently this patch rewrites this very special case to a standard WildcardQuery. bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. No, what I am saying is that you still have to index the terms in reversed order for the leading */.* case, except then this reversing buys you faster wildcard AND regex queries :) Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780441#action_12780441 ] Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:16 AM: --- bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem, i mean depending on your term dictionary if you do something stupid like ???abacadaba it probably wont be that fast. I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives. There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests) Because of this, currently this patch rewrites this very special case to a standard WildcardQuery. bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries :) was (Author: rcmuir): bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem, i mean depending on your term dictionary if you do something stupid like ???abacadaba it probably wont be that fast. I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives. There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests) Because of this, currently this patch rewrites this very special case to a standard WildcardQuery. bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. No, what I am saying is that you still have to index the terms in reversed order for the leading */.* case, except then this reversing buys you faster wildcard AND regex queries :) Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780445#action_12780445 ] Mark Miller commented on LUCENE-1606: - Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think - though you've taken care of that anyway - so sounds like gold to me. I didn't expect this to solve leading wildcard issues, so no loss to me. bq. No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries bummer :) Does it make sense to implement here though? Isn't the ReverseStringFilter enough if a user wants to go this route? Solr's support for this is fairly good, but I don't think it needs to be as 'built in' for Lucene? Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780447#action_12780447 ] Robert Muir commented on LUCENE-1606: - bq. Does it make sense to implement here though? I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary. I found out what was acceptable was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this. This performed very well, because it can still avoid comparing many terms. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780447#action_12780447 ] Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:31 AM: --- bq. Does it make sense to implement here though? I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary. I found out what was acceptable (clarification: to these specific users/system) was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this. This performed very well, because it can still avoid comparing many terms. was (Author: rcmuir): bq. Does it make sense to implement here though? I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary. I found out what was acceptable was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this. This performed very well, because it can still avoid comparing many terms. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780452#action_12780452 ] Mark Miller commented on LUCENE-1606: - That is a cool tradeoff to be able to make. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
On Fri, Nov 20, 2009 at 6:30 AM, Ryan McKinley ryan...@gmail.com wrote: On Nov 19, 2009, at 3:34 PM, Mark Miller wrote: Ryan McKinley wrote: I would love to set goals that are ~3 months out so that we don't have another 1 year release cycle. For a 2.0 release where we could have more back-compatibly flexibility, i would love to see some work that may be too ambitious... In particular, the config spaghetti needs some attention. I don't see the need to increment solr to 2.0 for the lucene 3.0 change -- of course that needs to be noted, but incrementing the major number in solr only makes sense if we are going to change *solr* significantly. Lucene major numbers don't work that way, and I don't think Solr needs to work that way be default. I think major numbers are better for indicating backwards compat issues than major features with the way these projects work. Which is why Yonik mentions 1.5 with weaker back compat - its not just the fact that we are going to Lucene 3.x - its that Solr still relies on some of the API's that won't be around in 3.x - they are not all trivial to remove or to remove while preserving back compat. I confess I don't know the details of the changes that have not yet been integrated in solr -- the only lucene changes I am familiar with is what was required for solr 1.4. The lucene 2.x - 3.0 upgrade path seems independent of that to me. I would even argue that with solr 1.4 we have already required many lucene 3.0 changes -- All my custom lucene stuff had to be reworked to work with solr 1.4 (tokenizers multi-reader filters). Many - but certainly not all. Just my luck... I'm batting 1000 :) But that means my code can upgrade to 3.0 without a issue now! In general, I wonder where the solr back-compatibility contract applies (and to what degree). For solr, I would rank the importance as: #1 - the URL API syntax. Client query parameters should change as little as possible #2 - configuration #3 - java APIs Someone else would likely rank it differently - not everyone using Solr even uses HTTP with it. Someone heavily involved in custom plugins might care more about that than config. As a dev, I just plainly rank them all as important and treat them on a case by case basis. I think it is fair to suggest that people will have the most stable/consistent/seamless upgrade path if you stick to the HTTP API (and by extension most of the solrj API) I am not suggesting that the java APIs are not important and that back-compatibly is not important. Solr has a some APIs with a clear purpose, place, and intended use -- we need to take these very seriously. We also have lots of APIs that are half baked and loosy goosy. If a developer is working on the edges, i think it is fair to expect more hickups in the upgrade path. With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most sense. Unless we see making serious changes to solr that would warrent a major release bump solr 1.5 with lucene 3.x is a good option. Solr 2.0 can have non-back compat changes for Solr itself. e.g removing the single core option , changing configuration, REST Api changes etc What is a serious change that would warrant a bump in your opinion? for example: - config overhaul. detangle the XML from the components. perhaps using spring. This is already done. No components read config from xml anymore SOLR-1198 - major URL request changes. perhaps we change things to be more RESTful -- perhaps let jersey take care of the URL/request building https://jersey.dev.java.net/ - perhaps OSGi support/control/configuration Lucene has an explict back-compatibility contract: http://wiki.apache.org/lucene-java/BackwardsCompatibility I don't know if solr has one... if we make one, I would like it to focus on the URL syntax+configuration Its not nice to give people plugins and then not worry about back compat for them :) i want to be nice. I just think that a different back compatibility contract applies for solr then for lucene. It seems reasonable to consider the HTTP API, configs, and java API independently. From my perspective, saying solr 1.5 uses lucene 3.0 implies everything a plugin developer using lucene APIs needs to know about the changes. To be clear, I am not against bumping to solr 2.0 -- I just have high aspirations (yet little time) for what a 2.0 bump could mean for solr. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Noble Paul | Principal Engineer| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780471#action_12780471 ] Robert Muir commented on LUCENE-1606: - bq. That is a cool tradeoff to be able to make. Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, to enable leading .* regex support with ReverseStringFilter. you can still use this Wildcard impl with ReverseStringFilter just like the core Wildcard impl, because its just so easy to reverse a wildcard string. but you don't want to try to reverse a regular expression! that would be hairy. easier to reverse a DFA. but even without this, there are tons of workarounds, like the tradeoff i mentioned earlier. also, another one that might not be apparent is that its only the leading .* that is a problem, depending on corpus of course. [a-z].*abacadaba will avoid visiting terms that start with 1,2,3 or are in chinese, etc, which might be a nice improvement. of course if all your terms start with a-z, then its gonna be the same as entering .*abacadaba, and be bad. all depends on how selective the regular expression is wrt your terms. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org