Re: Solr 1.5 or 2.0?

2009-11-19 Thread Simon Willnauer
On Thu, Nov 19, 2009 at 2:53 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 What should the next version of Solr be?

 Options:
 - have a Solr 1.5 with a lucene 2.9.x
 - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all
 of the removed lucene deprecations from 2.9-3.0
 - have a Solr 2.0 with a lucene 3.x

My first feeling is that Solr 2.0 with Lucene 3.x would be a clean
cut. What is your back compat policy for major version jumps?


 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2075) Share the Term - TermInfo cache across threads

2009-11-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2075:
--

Attachment: LUCENE-2075.patch

Updated patch, adds missing @Overrides, we added in 3.0 and also makes the 
private PQ implement Iterable, the markAndSweep code is now synactical sugar :-)

 Share the Term - TermInfo cache across threads
 ---

 Key: LUCENE-2075
 URL: https://issues.apache.org/jira/browse/LUCENE-2075
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
 LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch


 Right now each thread creates its own (thread private) SimpleLRUCache,
 holding up to 1024 terms.
 This is rather wasteful, since if there are a high number of threads
 that come through Lucene, you're multiplying the RAM usage.  You're
 also cutting way back on likelihood of a cache hit (except the known
 multiple times we lookup a term within-query, which uses one thread).
 In NRT search we open new SegmentReaders (on tiny segments) often
 which each thread must then spend CPU/RAM creating  populating.
 Now that we are on 1.5 we can use java.util.concurrent.*, eg
 ConcurrentHashMap.  One simple approach could be a double-barrel LRU
 cache, using 2 maps (primary, secondary).  You check the cache by
 first checking primary; if that's a miss, you check secondary and if
 you get a hit you promote it to primary.  Once primary is full you
 clear secondary and swap them.
 Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-1799) Unicode compression

2009-11-19 Thread Steven A Rowe
Hi Robert,

On 11/18/2009 at 7:16 PM, Robert Muir wrote:
 Looking at the collation support, we could maybe improve
 IndexableBinaryStringTools by using char[]/byte[] with offset and
 length. The existing ByteBuffer/CharBuffer methods could stay, they are
 consistent with Charset api and are not wrong imo, but instead defer to
 the new char[]/byte[] ones... the current buffer-based ones require the
 buffer to have a backing array anyway or will throw an exception.

+1

I used *Buffers because I thought it simplified method prototypes, no other 
reason.

Steve



RE: Solr 1.5 or 2.0?

2009-11-19 Thread Uwe Schindler
We also had some (maybe helpful) opinions :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, November 19, 2009 3:31 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Solr 1.5 or 2.0?
 
 Oops... of course I meant to post this in solr-dev.
 
 -Yonik
 http://www.lucidimagination.com
 
 On Wed, Nov 18, 2009 at 8:53 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  What should the next version of Solr be?
 
  Options:
  - have a Solr 1.5 with a lucene 2.9.x
  - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all
  of the removed lucene deprecations from 2.9-3.0
  - have a Solr 2.0 with a lucene 3.x
 
  -Yonik
  http://www.lucidimagination.com
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Solr 1.5 or 2.0?

2009-11-19 Thread Noble Paul നോബിള്‍ नोब्ळ्
option 3 looks best . But do we plan to remove anything we have not
already marked as deprecated?

On Thu, Nov 19, 2009 at 8:10 PM, Uwe Schindler u...@thetaphi.de wrote:
 We also had some (maybe helpful) opinions :-)

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, November 19, 2009 3:31 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Solr 1.5 or 2.0?

 Oops... of course I meant to post this in solr-dev.

 -Yonik
 http://www.lucidimagination.com

 On Wed, Nov 18, 2009 at 8:53 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  What should the next version of Solr be?
 
  Options:
  - have a Solr 1.5 with a lucene 2.9.x
  - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all
  of the removed lucene deprecations from 2.9-3.0
  - have a Solr 2.0 with a lucene 3.x
 
  -Yonik
  http://www.lucidimagination.com
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1799) Unicode compression

2009-11-19 Thread Robert Muir
Steven, do you still have a test setup to measure collation key generation
performance with Lucene?

On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote:

 Hi Robert,

 On 11/18/2009 at 7:16 PM, Robert Muir wrote:
  Looking at the collation support, we could maybe improve
  IndexableBinaryStringTools by using char[]/byte[] with offset and
  length. The existing ByteBuffer/CharBuffer methods could stay, they are
  consistent with Charset api and are not wrong imo, but instead defer to
  the new char[]/byte[] ones... the current buffer-based ones require the
  buffer to have a backing array anyway or will throw an exception.

 +1

 I used *Buffers because I thought it simplified method prototypes, no other
 reason.

 Steve




-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1799) Unicode compression

2009-11-19 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780129#action_12780129
 ] 

DM Smith commented on LUCENE-1799:
--

The sample code is probably what is on this page, here:
http://unicode.org/notes/tn6/#Sample_Code

From what I gather reading the whole page:
If we port the sample code and the test case and then provide demonstration 
that all test pass, then we will be granted a license.

There's contact info at the bottom of the page for getting the license. Maybe, 
contact them for clarification?

As the code is fairly small, I don't think it would be too hard to port. The 
trick is that the sample code appears to deal in 32-bit arrays and we'd 
probably want a byte[].

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor

 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-1799) Unicode compression

2009-11-19 Thread Steven A Rowe
Hi Robert,

Ack, actually two days ago I updated my Lucene trunk checkout and removed that 
code, thinking its utility had evaporated!

But maybe IntelliJ will save my bacon in its local history cache.  (Praise 
IntelliJ!)  I'll check tonight when I get home.

Steve

On 11/19/2009 at 10:16 AM, Robert Muir wrote:
 Steven, do you still have a test setup to measure collation key
 generation performance with Lucene?
 
 
 On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote:
 
 
   Hi Robert,
 
 
   On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the
 collation support, we could maybe improve  IndexableBinaryStringTools
 by using char[]/byte[] with offset and length. The existing
 ByteBuffer/CharBuffer methods could stay, they are consistent with
 Charset api and are not wrong imo, but instead defer tothe new
 char[]/byte[] ones... the current buffer-based ones require the   
 buffer to have a backing array anyway or will throw an exception.
 
 
   +1
 
   I used *Buffers because I thought it simplified method
 prototypes, no other reason.
 
   Steve
 
 
 
 
 
 
 --
 Robert Muir
 rcm...@gmail.com




Re: [jira] Commented: (LUCENE-1799) Unicode compression

2009-11-19 Thread Robert Muir
doh! well if you have it, that will be very handy for verification.
I'll create a separate issue for this shortly, maybe you can review the
patch

Thanks,
Robert

On Thu, Nov 19, 2009 at 1:06 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Robert,

 Ack, actually two days ago I updated my Lucene trunk checkout and removed
 that code, thinking its utility had evaporated!

 But maybe IntelliJ will save my bacon in its local history cache.  (Praise
 IntelliJ!)  I'll check tonight when I get home.

 Steve

 On 11/19/2009 at 10:16 AM, Robert Muir wrote:
  Steven, do you still have a test setup to measure collation key
  generation performance with Lucene?
 
 
  On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe sar...@syr.edu wrote:
 
 
Hi Robert,
 
 
On 11/18/2009 at 7:16 PM, Robert Muir wrote: Looking at the
  collation support, we could maybe improve 
 IndexableBinaryStringTools
  by using char[]/byte[] with offset and length. The existing
  ByteBuffer/CharBuffer methods could stay, they are consistent with
  Charset api and are not wrong imo, but instead defer tothe new
  char[]/byte[] ones... the current buffer-based ones require the   
  buffer to have a backing array anyway or will throw an exception.
 
 
+1
 
I used *Buffers because I thought it simplified method
  prototypes, no other reason.
 
Steve
 
 
 
 
 
 
  --
  Robert Muir
  rcm...@gmail.com





-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread David Kaelbling (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780193#action_12780193
 ] 

David Kaelbling commented on LUCENE-2039:
-

I apologize if I haven't read the comments carefully enough, but in 
LUCENE-2039_field_ext.patch why is ExtendableQueryParser final?  That means 
(for example) that ComplexPhraseQueryParser cannot subclass it.  In the earlier 
LUCENE-2039.patch the complex phrase parser picked up the changes for free.


 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-11-19 Thread Robert Muir (JIRA)
remove Byte/CharBuffer wrapping for collation key generation


 Key: LUCENE-2084
 URL: https://issues.apache.org/jira/browse/LUCENE-2084
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1
 Attachments: LUCENE-2084.patch

We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
CollationKeyFilter and ICUCollationKeyFilter.

this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
delegating to these.
Previously, the Byte/CharBuffer methods required a backing array anyway.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-11-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2084:


Attachment: LUCENE-2084.patch

 remove Byte/CharBuffer wrapping for collation key generation
 

 Key: LUCENE-2084
 URL: https://issues.apache.org/jira/browse/LUCENE-2084
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2084.patch


 We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
 CollationKeyFilter and ICUCollationKeyFilter.
 this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
 and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
 delegating to these.
 Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2084) remove Byte/CharBuffer wrapping for collation key generation

2009-11-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2084:


Priority: Minor  (was: Major)

 remove Byte/CharBuffer wrapping for collation key generation
 

 Key: LUCENE-2084
 URL: https://issues.apache.org/jira/browse/LUCENE-2084
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2084.patch


 We can remove the overhead of ByteBuffer and CharBuffer wrapping in 
 CollationKeyFilter and ICUCollationKeyFilter.
 this patch moves the logic in IndexableBinaryStringTools into char[],int,int 
 and byte[],int,int based methods, with the previous Byte/CharBuffer methods 
 delegating to these.
 Previously, the Byte/CharBuffer methods required a backing array anyway.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread David Kaelbling (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780193#action_12780193
 ] 

David Kaelbling edited comment on LUCENE-2039 at 11/19/09 8:22 PM:
---

I apologize if I haven't read the comments carefully enough, but in 
LUCENE-2039_field_ext.patch why is ExtendableQueryParser final?  That means 
(for example) that ComplexPhraseQueryParser cannot subclass it.  In the earlier 
LUCENE-2039.patch the complex phrase parser picked up the changes for free.

And would RegexParserExtension maybe be easier to use if it set the 
RegexCapabilities on the new RegexQuery it is returning?


  was (Author: dkaelbl...@blackducksoftware.com):
I apologize if I haven't read the comments carefully enough, but in 
LUCENE-2039_field_ext.patch why is ExtendableQueryParser final?  That means 
(for example) that ComplexPhraseQueryParser cannot subclass it.  In the earlier 
LUCENE-2039.patch the complex phrase parser picked up the changes for free.

  
 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780254#action_12780254
 ] 

Robert Muir commented on LUCENE-2039:
-

Hi, in my opinion RegexParserExtension should not be  tied to 
RegexQuery/RegexCapabilities.
This is only one possible implementation of regex support and has some 
scalability problems.


 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780258#action_12780258
 ] 

Simon Willnauer commented on LUCENE-2039:
-

bq. That means (for example) that ComplexPhraseQueryParser cannot subclass it
This patch was not meant to include ComplexPhraseQueryParser it is rather a 
proposal for the concept of field overloading. But you are right the parser 
should not be final at all especially if you wanna override a get*query method 
it should be expendable. 

bq. Hi, in my opinion RegexParserExtension should not be tied to 
RegexQuery/RegexCapabilities.
This is only one possible implementation of regex support and has some 
scalability problems. 

Also true, but again this is just a POC to show how it would look like. 
Comments on the concept would be more useful by now. 
I did write that up during a train ride and aimed to get some comments. I 
already have worked on it and will upload a new patch soon which includes 
RegexCapabilities + tests. 
Thanks again for the pointer with the final class.

 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2039:


Attachment: LUCENE-2039_field_ext.patch

Updated the patch
 - removed final modifier from ExtendableQueryParser
 - added RegexCapabilities ctor to RegexParserExtension

I still need to work on the Extensions JavaDoc - and I'm not too happy with the 
name. 

Comments on the concept are very welcome.

 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, 
 LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2085) Update PayloadSpanUtil

2009-11-19 Thread Mark Miller (JIRA)
Update PayloadSpanUtil
--

 Key: LUCENE-2085
 URL: https://issues.apache.org/jira/browse/LUCENE-2085
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1
Reporter: Mark Miller
Assignee: Mark Miller




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780316#action_12780316
 ] 

Mark Miller commented on LUCENE-2039:
-

It looks like the patch puts this in core? Any compelling reason? Offhand I'd 
think it would go in the misc contrib with the other queryparsers that extend 
the core queryparser.

 Regex support and beyond in JavaCC QueryParser
 --

 Key: LUCENE-2039
 URL: https://issues.apache.org/jira/browse/LUCENE-2039
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Simon Willnauer
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, 
 LUCENE-2039_field_ext.patch


 Since the early days the standard query parser was limited to the queries 
 living in core, adding other queries or extending the parser in any way 
 always forced people to change the grammar file and regenerate. Even if you 
 change the grammar you have to be extremely careful how you modify the parser 
 so that other parts of the standard parser are affected by customisation 
 changes. Eventually you had to live with all the limitation the current 
 parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
 the chance to look at the tokens. 
 I was thinking about how to overcome the limitation and add regex support to 
 the query parser without introducing any dependency to core. I added a new 
 special character that basically prevents the parser from interpreting any of 
 the characters enclosed in the new special characters. I choose the forward 
 slash  '/' as the delimiter so that everything in between two forward slashes 
 is basically escaped and ignored by the parser. All chars embedded within 
 forward slashes are treated as one token even if it contains other special 
 chars like * []?{} or whitespaces. This token is subsequently passed to a 
 pluggable parser extension with builds a query from the embedded string. I 
 do not interpret the embedded string in any way but leave all the subsequent 
 work to the parser extension. Such an extension could be another full 
 featured query parser itself or simply a ctor call for regex query. The 
 interface remains quiet simple but makes the parser extendible in an easy way 
 compared to modifying the javaCC sources.
 The downsides of this patch is clearly that I introduce a new special char 
 into the syntax but I guess that would not be that much of a deal as it is 
 reflected in the escape method though. It would truly be nice to have more 
 than once extension an have this even more flexible so treat this patch as a 
 kickoff though.
 Another way of solving the problem with RegexQuery would be to move the JDK 
 version of regex into the core and simply have another method like:
 {code}
 protected Query newRegexQuery(Term t) {
   ... 
 }
 {code}
 which I would like better as it would be more consistent with the idea of the 
 query parser to be a very strict and defined parser.
 I will upload a patch in a second which implements the extension based 
 approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Solr 1.5 or 2.0?

2009-11-19 Thread Ryan McKinley
I would love to set goals that are ~3 months out so that we don't have  
another 1 year release cycle.  For a 2.0 release where we could have  
more back-compatibly flexibility, i would love to see some work that  
may be too ambitious...  In particular, the config spaghetti needs  
some attention.


I don't see the need to increment solr to 2.0 for the lucene 3.0  
change -- of course that needs to be noted, but incrementing the major  
number in solr only makes sense if we are going to change *solr*  
significantly.


The lucene 2.x - 3.0 upgrade path seems independent of that to me.  I  
would even argue that with solr 1.4 we have already required many  
lucene 3.0 changes -- All my custom lucene stuff had to be reworked to  
work with solr 1.4 (tokenizers  multi-reader filters).


In general, I wonder where the solr back-compatibility contract  
applies (and to what degree).  For solr, I would rank the importance as:
#1 - the URL API syntax.  Client query parameters should change as  
little as possible

#2 - configuration
#3 - java APIs

With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most  
sense.  Unless we see making serious changes to solr that would  
warrent a major release bump.


Lucene has an explict back-compatibility contract:
http://wiki.apache.org/lucene-java/BackwardsCompatibility

I don't know if solr has one...  if we make one, I would like it to  
focus on the URL syntax+configuration


ryan



On Nov 18, 2009, at 5:53 PM, Yonik Seeley wrote:


What should the next version of Solr be?

Options:
- have a Solr 1.5 with a lucene 2.9.x
- have a Solr 1.5 with a lucene 3.x, with weaker back compat given all
of the removed lucene deprecations from 2.9-3.0
- have a Solr 2.0 with a lucene 3.x

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Solr 1.5 or 2.0?

2009-11-19 Thread Mark Miller
Ryan McKinley wrote:
 I would love to set goals that are ~3 months out so that we don't have
 another 1 year release cycle.  For a 2.0 release where we could have
 more back-compatibly flexibility, i would love to see some work that
 may be too ambitious...  In particular, the config spaghetti needs
 some attention.

 I don't see the need to increment solr to 2.0 for the lucene 3.0
 change -- of course that needs to be noted, but incrementing the major
 number in solr only makes sense if we are going to change *solr*
 significantly.
Lucene major numbers don't work that way, and I don't think Solr needs
to work that way be default. I think major numbers are better for
indicating backwards compat issues than major features with the way
these projects work. Which is why Yonik mentions 1.5 with weaker back
compat - its not just the fact that we are going to Lucene 3.x - its
that Solr still relies on some of the API's that won't be around in 3.x
- they are not all trivial to remove or to remove while preserving back
compat.


 The lucene 2.x - 3.0 upgrade path seems independent of that to me.  I
 would even argue that with solr 1.4 we have already required many
 lucene 3.0 changes -- All my custom lucene stuff had to be reworked to
 work with solr 1.4 (tokenizers  multi-reader filters).
Many - but certainly not all.

 In general, I wonder where the solr back-compatibility contract
 applies (and to what degree).  For solr, I would rank the importance as:
 #1 - the URL API syntax.  Client query parameters should change as
 little as possible
 #2 - configuration
 #3 - java APIs
Someone else would likely rank it differently - not everyone using Solr
even uses HTTP with it. Someone heavily involved in custom plugins might
care more about that than config. As a dev, I just plainly rank them all
as important and treat them on a case by case basis.

 With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most
 sense.  Unless we see making serious changes to solr that would
 warrent a major release bump.
What is a serious change that would warrant a bump in your opinion?

 Lucene has an explict back-compatibility contract:
 http://wiki.apache.org/lucene-java/BackwardsCompatibility

 I don't know if solr has one...  if we make one, I would like it to
 focus on the URL syntax+configuration
Its not nice to give people plugins and then not worry about back compat
for them :)

 ryan



 On Nov 18, 2009, at 5:53 PM, Yonik Seeley wrote:

 What should the next version of Solr be?

 Options:
 - have a Solr 1.5 with a lucene 2.9.x
 - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all
 of the removed lucene deprecations from 2.9-3.0
 - have a Solr 2.0 with a lucene 3.x

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



CustomScoreQuery Explanation

2009-11-19 Thread Michael Garski
Hi there - 

 

I'm helping out with the Lucene.Net port of 2.9, and when rooting around
in CustomScoreQuery.CustomWeight, I noticed what appears to be an
unnecessary call to doExplain in the explain method.

 

Current method in trunk:

 

public Explanation explain(IndexReader reader, int doc) throws
IOException {

  Explanation explain = doExplain(reader, doc);

  return explain == null ? new Explanation(0.0f, no matching docs)
: doExplain(reader, doc);

}

 

Is there a reason it shouldn't be:

 

public Explanation explain(IndexReader reader, int doc) throws
IOException {

  Explanation explain = doExplain(reader, doc);

  return explain == null ? new Explanation(0.0f, no matching docs)
: explain);

}

 

I might be overlooking something, but it appears to be two calls to
doExplain when only one would suffice.

 

Michael

 

Michael Garski

Sr. Search Architect 

310.969.7435 (office)

310.251.6355 (mobile)

www.myspace.com/michaelgarski

 



Re: CustomScoreQuery Explanation

2009-11-19 Thread Simon Willnauer
I don't see any reason why doExplain should be called twice. Can you create
an issue in jira please?

Simon

On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com wrote:

 Hi there –



I’m helping out with the Lucene.Net port of 2.9, and when rooting around in
CustomScoreQuery.CustomWeight, I noticed what appears to be an unnecessary
call to doExplain in the explain method.



Current method in trunk:



*public* Explanation explain(IndexReader reader, *int* doc)
*throws*IOException {

  Explanation explain = doExplain(reader, doc);

  *return* explain == *null* ? *new* Explanation(0.0f, no matching
docs) : doExplain(reader, doc);

}



Is there a reason it shouldn’t be:



*public* Explanation explain(IndexReader reader, *int* doc)
*throws*IOException {

  Explanation explain = doExplain(reader, doc);

  *return* explain == *null* ? *new* Explanation(0.0f, no matching
docs) : explain);

}



I might be overlooking something, but it appears to be two calls to
doExplain when only one would suffice.



Michael



Michael Garski

Sr. Search Architect

310.969.7435 (office)

310.251.6355 (mobile)

www.myspace.com/michaelgarski


RE: CustomScoreQuery Explanation

2009-11-19 Thread Michael Garski
Will do, along with a patch.

 

Michael

 

From: Simon Willnauer [mailto:simon.willna...@googlemail.com] 
Sent: Thursday, November 19, 2009 4:47 PM
To: java-dev@lucene.apache.org
Subject: Re: CustomScoreQuery Explanation

 

I don't see any reason why doExplain should be called twice. Can you create an 
issue in jira please?

Simon

On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com 
wrote:

Hi there – 

 

I’m helping out with the Lucene.Net port of 2.9, and when rooting 
around in CustomScoreQuery.CustomWeight, I noticed what appears to be an 
unnecessary call to doExplain in the explain method.

 

Current method in trunk:

 

public Explanation explain(IndexReader reader, int doc) throws 
IOException {

  Explanation explain = doExplain(reader, doc);

  return explain == null ? new Explanation(0.0f, no matching 
docs) : doExplain(reader, doc);

}

 

Is there a reason it shouldn’t be:

 

public Explanation explain(IndexReader reader, int doc) throws 
IOException {

  Explanation explain = doExplain(reader, doc);

  return explain == null ? new Explanation(0.0f, no matching 
docs) : explain);

}

 

I might be overlooking something, but it appears to be two calls to 
doExplain when only one would suffice.

 

Michael

 

Michael Garski

Sr. Search Architect 

310.969.7435 (office)

310.251.6355 (mobile)

www.myspace.com/michaelgarski

 



Re: Solr 1.5 or 2.0?

2009-11-19 Thread Ryan McKinley


On Nov 19, 2009, at 3:34 PM, Mark Miller wrote:


Ryan McKinley wrote:
I would love to set goals that are ~3 months out so that we don't  
have

another 1 year release cycle.  For a 2.0 release where we could have
more back-compatibly flexibility, i would love to see some work that
may be too ambitious...  In particular, the config spaghetti needs
some attention.

I don't see the need to increment solr to 2.0 for the lucene 3.0
change -- of course that needs to be noted, but incrementing the  
major

number in solr only makes sense if we are going to change *solr*
significantly.

Lucene major numbers don't work that way, and I don't think Solr needs
to work that way be default. I think major numbers are better for
indicating backwards compat issues than major features with the way
these projects work. Which is why Yonik mentions 1.5 with weaker back
compat - its not just the fact that we are going to Lucene 3.x - its
that Solr still relies on some of the API's that won't be around in  
3.x
- they are not all trivial to remove or to remove while preserving  
back

compat.


I confess I don't know the details of the changes that have not yet  
been integrated in solr  -- the only lucene changes I am familiar with  
is what was required for solr 1.4.









The lucene 2.x - 3.0 upgrade path seems independent of that to  
me.  I

would even argue that with solr 1.4 we have already required many
lucene 3.0 changes -- All my custom lucene stuff had to be reworked  
to

work with solr 1.4 (tokenizers  multi-reader filters).

Many - but certainly not all.


Just my luck...  I'm batting 1000 :)

But that means my code can upgrade to 3.0 without a issue now!




In general, I wonder where the solr back-compatibility contract
applies (and to what degree).  For solr, I would rank the  
importance as:

#1 - the URL API syntax.  Client query parameters should change as
little as possible
#2 - configuration
#3 - java APIs
Someone else would likely rank it differently - not everyone using  
Solr
even uses HTTP with it. Someone heavily involved in custom plugins  
might
care more about that than config. As a dev, I just plainly rank them  
all

as important and treat them on a case by case basis.


I think it is fair to suggest that people will have the most stable/ 
consistent/seamless upgrade path if you stick to the HTTP API (and by  
extension most of the solrj API)


I am not suggesting that the java APIs are not important and that back- 
compatibly is not important.  Solr has a some APIs with a clear  
purpose, place, and intended use -- we need to take these very  
seriously.  We also have lots of APIs that are half baked and loosy  
goosy.  If a developer is working on the edges, i think it is fair to  
expect more hickups in the upgrade path.





With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most
sense.  Unless we see making serious changes to solr that would
warrent a major release bump.

What is a serious change that would warrant a bump in your opinion?


for example:
- config overhaul.  detangle the XML from the components.  perhaps  
using spring.
- major URL request changes.  perhaps we change things to be more  
RESTful -- perhaps let jersey take care of the URL/request building https://jersey.dev.java.net/

- perhaps OSGi support/control/configuration




Lucene has an explict back-compatibility contract:
http://wiki.apache.org/lucene-java/BackwardsCompatibility

I don't know if solr has one...  if we make one, I would like it to
focus on the URL syntax+configuration
Its not nice to give people plugins and then not worry about back  
compat

for them :)


i want to be nice.  I just think that a different back compatibility  
contract applies for solr then for lucene.  It seems reasonable to  
consider the HTTP API, configs, and java API independently.


From my perspective, saying solr 1.5 uses lucene 3.0 implies  
everything a plugin developer using lucene APIs needs to know about  
the changes.


To be clear, I am not against bumping to solr 2.0 -- I just have high  
aspirations (yet little time) for what a 2.0 bump could mean for solr.


ryan


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: CustomScoreQuery Explanation

2009-11-19 Thread Mark Miller
No worries - I think its a bit overkill for the change - I can just pop
it in real quick.

Michael Garski wrote:

 Will do, along with a patch.

  

 Michael

  

 *From:* Simon Willnauer [mailto:simon.willna...@googlemail.com]
 *Sent:* Thursday, November 19, 2009 4:47 PM
 *To:* java-dev@lucene.apache.org
 *Subject:* Re: CustomScoreQuery Explanation

  

 I don't see any reason why doExplain should be called twice. Can you
 create an issue in jira please?

 Simon

 On Nov 20, 2009 1:30 AM, Michael Garski mgar...@myspace-inc.com
 mailto:mgar...@myspace-inc.com wrote:

 Hi there –

  

 I’m helping out with the Lucene.Net port of 2.9, and when rooting
 around in CustomScoreQuery.CustomWeight, I noticed what appears to
 be an unnecessary call to doExplain in the explain method.

  

 Current method in trunk:

  

 *public* Explanation explain(IndexReader reader, *int* doc)
 *throws* IOException {

   Explanation explain = doExplain(reader, doc);

   *return* explain == *null* ? *new* Explanation(0.0f, no
 matching docs) : doExplain(reader, doc);

 }

  

 Is there a reason it shouldn’t be:

  

 *public* Explanation explain(IndexReader reader, *int* doc)
 *throws* IOException {

   Explanation explain = doExplain(reader, doc);

   *return* explain == *null* ? *new* Explanation(0.0f, no
 matching docs) : explain);

 }

  

 I might be overlooking something, but it appears to be two calls
 to doExplain when only one would suffice.

  

 Michael

  

 Michael Garski

 Sr. Search Architect

 310.969.7435 (office)

 310.251.6355 (mobile)

 www.myspace.com/michaelgarski http://www.myspace.com/michaelgarski

  



-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-11-19 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780368#action_12780368
 ] 

Erick Erickson commented on LUCENE-2037:


Well, last night I changed LocalizedTestCase to do the @RunWith and
@Parameterized thing and it works just fine with a minimal change to
subclasses, mainly adding @Test and a c'tor with a Locale parameter. Total,
it adds probably a minute to the test run.

About the cross product of versions and locales. The @Parameterized thingy
returns a list of Object[], where the elements of the list are matched
against a c'tor. So if each object[] in your list has, say, an (int, float,
int), then as long as you have a matching c'tor with a signature that takes
an (int, float, int) you're good to go. So to handle the mXn case you
mentioned, if your @Parameters method returned a list of object[], one
object[] for each Locale, Version pair, you'd get all your Locales run
against all your versions.

Whether we *want* this to happen or not is another question. It's a
worthwhile question whether we really *need* to run all the possible locales
or if there's a subset of locales that would serve.

It's kind of ironic that I have a patch waiting to be applied that cuts down
on the time it takes to run the unit tests and another patch that adds to
the time it takes. Two steps forward, one step back and a jink sideways just
for fun.

Best
Erick




 Allow Junit4 tests in our environment.
 --

 Key: LUCENE-2037
 URL: https://issues.apache.org/jira/browse/LUCENE-2037
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Affects Versions: 3.1
 Environment: Development
Reporter: Erick Erickson
Assignee: Erick Erickson
Priority: Minor
 Fix For: 3.1

 Attachments: junit-4.7.jar, LUCENE-2037.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
 Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
 have to be rewritten. We should start this for the 3.1 release so we can get 
 a clean 3.0 out smoothly.
 It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780388#action_12780388
 ] 

Robert Muir commented on LUCENE-2037:
-

{quote}
It's a worthwhile question whether we really need to run all the possible 
locales
or if there's a subset of locales that would serve.
{quote}

I won't rant too much on this, except to say that before this localizedtestcase,
various parts failed under say, only Korean, or only Thai locale, it was always 
a corner case.

I think its important that someone from say Korea, can download lucene source 
code,
and run 'ant test'. how else are they supposed to contribute if this does not 
work?

 Allow Junit4 tests in our environment.
 --

 Key: LUCENE-2037
 URL: https://issues.apache.org/jira/browse/LUCENE-2037
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Affects Versions: 3.1
 Environment: Development
Reporter: Erick Erickson
Assignee: Erick Erickson
Priority: Minor
 Fix For: 3.1

 Attachments: junit-4.7.jar, LUCENE-2037.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
 Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
 have to be rewritten. We should start this for the 3.1 release so we can get 
 a clean 3.0 out smoothly.
 It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780430#action_12780430
 ] 

Mark Miller commented on LUCENE-1606:
-

So Robert - what do you think about paring down the automaton lib, and shoving 
all this in core? I want it, I want, I want it :)

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780439#action_12780439
 ] 

Robert Muir commented on LUCENE-1606:
-

By the way Mark, in case you are interested, the TermEnum here still has 
problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is 
still slow.

but there are algorithms to reverse an entire dfa, so you could use 
ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780440#action_12780440
 ] 

Mark Miller edited comment on LUCENE-1606 at 11/20/09 5:10 AM:
---

{quote}
By the way Mark, in case you are interested, the TermEnum here still has 
problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is 
still slow.
{quote}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any 
reason we wouldn't want to replace the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use 
ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you 
saying we can do a prefix match, but without having to index terms reversed in 
the index? That would be very cool.

  was (Author: markrmil...@gmail.com):
{code}
By the way Mark, in case you are interested, the TermEnum here still has 
problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is 
still slow.
{code}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any 
reason we wouldn't want to the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use 
ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you 
saying we can do a prefix match, but without having to index terms reversed in 
the index? That would be very cool.
  
 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780440#action_12780440
 ] 

Mark Miller commented on LUCENE-1606:
-

{code}
By the way Mark, in case you are interested, the TermEnum here still has 
problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is 
still slow.
{code}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any 
reason we wouldn't want to the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use 
ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you 
saying we can do a prefix match, but without having to index terms reversed in 
the index? That would be very cool.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780441#action_12780441
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any 
reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, 
leading ? is no longer a problem, 
i mean depending on your term dictionary if you do something stupid like 
???abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure 
performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA 
than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a 
standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are 
you saying we can do a prefix match, but without having to index terms reversed 
in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed 
order for the leading */.* case, except then this reversing buys you faster 
wildcard AND regex queries :)


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780441#action_12780441
 ] 

Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:16 AM:
---

bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any 
reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, 
leading ? is no longer a problem, 
i mean depending on your term dictionary if you do something stupid like 
???abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure 
performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA 
than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a 
standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are 
you saying we can do a prefix match, but without having to index terms reversed 
in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed 
order for the leading *  or .* case, except then this reversing buys you faster 
wildcard AND regex queries :)


  was (Author: rcmuir):
bq. No problem in my mind - nothing the current WildcardQuery doesn't face. 
Any reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, 
leading ? is no longer a problem, 
i mean depending on your term dictionary if you do something stupid like 
???abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure 
performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA 
than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a 
standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are 
you saying we can do a prefix match, but without having to index terms reversed 
in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed 
order for the leading */.* case, except then this reversing buys you faster 
wildcard AND regex queries :)

  
 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780445#action_12780445
 ] 

Mark Miller commented on LUCENE-1606:
-

Okay - still not an issue I don't think - leading wildcards are already an 
issue - 5% is worth the other speedups I think - though you've taken care of 
that anyway - so sounds like gold to me. I didn't expect this to solve leading 
wildcard issues, so no loss to me.

bq. No, what I am saying is that you still have to index the terms in reversed 
order for the leading * or .* case, except then this reversing buys you faster 
wildcard AND regex queries 

bummer :) Does it make sense to implement here though? Isn't the 
ReverseStringFilter enough if a user wants to go this route? Solr's support for 
this is fairly good, but I don't think it needs to be as 'built in' for Lucene?

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780447#action_12780447
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * 
wildcards on 100M+ term dictionary.
I found out what was acceptable was for *  to actually match .{0,3} (between 0 
and 3 of anything), and rewrote it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780447#action_12780447
 ] 

Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:31 AM:
---

bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * 
wildcards on 100M+ term dictionary.
I found out what was acceptable (clarification: to these specific users/system) 
was for *  to actually match .{0,3} (between 0 and 3 of anything), and rewrote 
it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.

  was (Author: rcmuir):
bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * 
wildcards on 100M+ term dictionary.
I found out what was acceptable was for *  to actually match .{0,3} (between 0 
and 3 of anything), and rewrote it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.
  
 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780452#action_12780452
 ] 

Mark Miller commented on LUCENE-1606:
-

That is a cool tradeoff to be able to make.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Solr 1.5 or 2.0?

2009-11-19 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Nov 20, 2009 at 6:30 AM, Ryan McKinley ryan...@gmail.com wrote:

 On Nov 19, 2009, at 3:34 PM, Mark Miller wrote:

 Ryan McKinley wrote:

 I would love to set goals that are ~3 months out so that we don't have
 another 1 year release cycle.  For a 2.0 release where we could have
 more back-compatibly flexibility, i would love to see some work that
 may be too ambitious...  In particular, the config spaghetti needs
 some attention.

 I don't see the need to increment solr to 2.0 for the lucene 3.0
 change -- of course that needs to be noted, but incrementing the major
 number in solr only makes sense if we are going to change *solr*
 significantly.

 Lucene major numbers don't work that way, and I don't think Solr needs
 to work that way be default. I think major numbers are better for
 indicating backwards compat issues than major features with the way
 these projects work. Which is why Yonik mentions 1.5 with weaker back
 compat - its not just the fact that we are going to Lucene 3.x - its
 that Solr still relies on some of the API's that won't be around in 3.x
 - they are not all trivial to remove or to remove while preserving back
 compat.

 I confess I don't know the details of the changes that have not yet been
 integrated in solr  -- the only lucene changes I am familiar with is what
 was required for solr 1.4.






 The lucene 2.x - 3.0 upgrade path seems independent of that to me.  I
 would even argue that with solr 1.4 we have already required many
 lucene 3.0 changes -- All my custom lucene stuff had to be reworked to
 work with solr 1.4 (tokenizers  multi-reader filters).

 Many - but certainly not all.

 Just my luck...  I'm batting 1000 :)

 But that means my code can upgrade to 3.0 without a issue now!



 In general, I wonder where the solr back-compatibility contract
 applies (and to what degree).  For solr, I would rank the importance as:
 #1 - the URL API syntax.  Client query parameters should change as
 little as possible
 #2 - configuration
 #3 - java APIs

 Someone else would likely rank it differently - not everyone using Solr
 even uses HTTP with it. Someone heavily involved in custom plugins might
 care more about that than config. As a dev, I just plainly rank them all
 as important and treat them on a case by case basis.

 I think it is fair to suggest that people will have the most
 stable/consistent/seamless upgrade path if you stick to the HTTP API (and by
 extension most of the solrj API)

 I am not suggesting that the java APIs are not important and that
 back-compatibly is not important.  Solr has a some APIs with a clear
 purpose, place, and intended use -- we need to take these very seriously.
  We also have lots of APIs that are half baked and loosy goosy.  If a
 developer is working on the edges, i think it is fair to expect more hickups
 in the upgrade path.



 With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most
 sense.  Unless we see making serious changes to solr that would
 warrent a major release bump
solr 1.5 with lucene 3.x is  a good option.
Solr 2.0 can have non-back compat changes for Solr itself. e.g
removing the single core option , changing configuration, REST Api
changes etc

 What is a serious change that would warrant a bump in your opinion?

 for example:
 - config overhaul.  detangle the XML from the components.  perhaps using
 spring.
This is already done. No components read config from xml anymore SOLR-1198
 - major URL request changes.  perhaps we change things to be more RESTful --
 perhaps let jersey take care of the URL/request building
 https://jersey.dev.java.net/
 - perhaps OSGi support/control/configuration



 Lucene has an explict back-compatibility contract:
 http://wiki.apache.org/lucene-java/BackwardsCompatibility

 I don't know if solr has one...  if we make one, I would like it to
 focus on the URL syntax+configuration

 Its not nice to give people plugins and then not worry about back compat
 for them :)

 i want to be nice.  I just think that a different back compatibility
 contract applies for solr then for lucene.  It seems reasonable to consider
 the HTTP API, configs, and java API independently.

 From my perspective, saying solr 1.5 uses lucene 3.0 implies everything a
 plugin developer using lucene APIs needs to know about the changes.

 To be clear, I am not against bumping to solr 2.0 -- I just have high
 aspirations (yet little time) for what a 2.0 bump could mean for solr.

 ryan


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780471#action_12780471
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. That is a cool tradeoff to be able to make. 

Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, 
to enable leading .* regex support with ReverseStringFilter.
you can still use this Wildcard impl with ReverseStringFilter just like the 
core Wildcard impl, because its just so easy to reverse a wildcard string.

but you don't want to try to reverse a regular expression! that would be hairy. 
easier to reverse a DFA.

but even without this, there are tons of workarounds, like the tradeoff i 
mentioned earlier.
also, another one that might not be apparent is that its only the leading .* 
that is a problem, depending on corpus of course.

[a-z].*abacadaba will avoid visiting terms that start with 1,2,3 or are in 
chinese, etc, which might be a nice improvement.
of course if all your terms start with a-z, then its gonna be the same as 
entering .*abacadaba, and be bad.

all depends on how selective the regular expression is wrt your terms.


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org