[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734015#action_12734015
 ] 

Luis Alves commented on LUCENE-1486:


I share same opinion as Michael,
the implementation has a lot of undefined/undocumented behaviors,
simple because it reuses the queryparser to parse the text inside a phrase. 
All the lucene syntax needs to be accounted on this design, but it does not 
seem to be the case.

Problems like Adriano described, phrase inside a phrase, position reporting for 
errors.

I also have a lot of concerns about having the full lucene syntax inside 
phrases 
and trying to restrict this by throwing exceptions for particular cases does 
not seem the best design.

Here is a example of with OR, AND, PARENTESIS with a proximity search
(( jakarta OR green) AND (blue AND orange)  AND black~2) apache~10

What should a user expect from this query, without looking at the code. I'm not 
sure.
Does it even make sense to support this complex syntax? In my opinion. no

I think we should define what is the subset of the language we want to support 
inside the phrases with a well defined behavior.
If Mark describes all the syntax he wants to support inside phrases, I actually 
don't mind to implement a new parser.for this.

My view is, contrib is probably a better place to have this code, until we 
figure out a implementation that does not impose as many restrictions on 
changes to the original queryparser and describes a well defined syntax to be 
applied inside phrases.



 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-07-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734016#action_12734016
 ] 

Simon Willnauer commented on LUCENE-1460:
-

bq. It seems like 1728 is ready to commit? Simon said on java-dev he will try 
to finish it by end of this week?

That is correct.  I can commit it today I think. Will make this issue dependent 
on 1728 and finish it by the end of today.

simon


 Change all contrib TokenStreams/Filters to use the new TokenStream API
 --

 Key: LUCENE-1460
 URL: https://issues.apache.org/jira/browse/LUCENE-1460
 Project: Lucene - Java
  Issue Type: Task
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
 LUCENE-1460_partial.txt


 Now that we have the new TokenStream API (LUCENE-1422) we should change all 
 contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734022#action_12734022
 ] 

Michael Busch commented on LUCENE-1448:
---

OK I think I have this basically working with old and new API (including 1693 
changes).

The approach I took is fairly simple, it doesn't require adding a new 
Attribute. I added the following method to TokenSteam:

{code:java}
  /**
   * This method is called by the consumer after the last token has been 
consumed, 
   * i.e. after {...@link #incrementToken()} returned codefalse/code (using 
the new TokenStream API)
   * or after {...@link #next(Token)} or {...@link #next()} returned 
codenull/code (old TokenStream API).
   * p/
   * This method can be used to perform any end-of-stream operations, such as 
setting the final
   * offset of a stream. The final offset of a stream might differ from the 
offset of the last token
   * e.g. in case one or more whitespaces followed after the last token, but a 
{...@link WhitespaceTokenizer}
   * was used.
   * p/
   * 
   * @throws IOException
   */
  public void end() throws IOException {
// do nothing by default
  }
{code}

Then I took Mike's patch and implemented end() in all classes where his patch 
added getFinalOffset(). 
E.g. in CharTokenizer the implementations looks like this:

{code:java}
  public void end() {
// set final offset
int finalOffset = input.correctOffset(offset);
offsetAtt.setOffset(finalOffset, finalOffset);
  }
{code}

I changed DocInverterPerField to call end() after the stream is fully consumed 
and use what 
offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734024#action_12734024
 ] 

Michael Busch commented on LUCENE-1460:
---

Cool! Thanks, Simon.

 Change all contrib TokenStreams/Filters to use the new TokenStream API
 --

 Key: LUCENE-1460
 URL: https://issues.apache.org/jira/browse/LUCENE-1460
 Project: Lucene - Java
  Issue Type: Task
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
 LUCENE-1460_partial.txt


 Now that we have the new TokenStream API (LUCENE-1422) we should change all 
 contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734023#action_12734023
 ] 

Michael Busch commented on LUCENE-1448:
---

Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter.

But it should be simple: CachingTokenFilter.end() should call input.end() when 
it is called for the first time and store the captured state locally as 
finalState. 
Then whenever CachingTokenFilter.end() is called again, it just restores the
finalState.

For Tee/Sink it should work similarly: The tee just puts a finalState into the
sink(s) the first time end() is called. And when end() of a sink is called it 
restores the finalState.

This should work?

 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734025#action_12734025
 ] 

Michael Busch commented on LUCENE-1448:
---

Hmm another reason why I don't like two Tees feeding one Sink:

What is the finalOffset and finalState then?

 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Description: 
This patch makes the following improvements to AttributeSource and
TokenStream/Filter:

- introduces interfaces for all Attributes. The corresponding
  implementations have the postfix 'Impl', e.g. TermAttribute and
  TermAttributeImpl. AttributeSource now has a factory for creating
  the Attribute instances; the default implementation looks for
  implementing classes with the postfix 'Impl'. Token now implements
  all 6 TokenAttribute interfaces.

- new method added to AttributeSource:
  addAttributeImpl(AttributeImpl). Using reflection it walks up in the
  class hierarchy of the passed in object and finds all interfaces
  that the class or superclasses implement and that extend the
  Attribute interface. It then adds the interface-instance mappings
  to the attribute map for each of the found interfaces.

- removes the set/getUseNewAPI() methods (including the standard
  ones). Instead it is now enough to only implement the new API, if one old 
TokenStream implements still the old API (next()/next(Token)), it is wrapped 
automatically. The delegation path is determined via reflection (the patch 
determines, which of the three methods was overridden).

- Token is no longer deprecated, instead it implements all 6 standard token 
interfaces (see above). The wrapper for next() and next(Token) uses this, to 
automatically map all attribute interfaces to one TokenWrapper instance 
(implementing all 6 interfaces), that contains a Token instance. next() and 
next(Token) exchange the inner Token instance as needed. For the new 
incrementToken(), only one TokenWrapper instance is visible, delegating to the 
currect reusable Token. This API also preserves custom Token subclasses, that 
maybe created by very special token streams (see example in Backwards-Test).

- AttributeImpl now has a default implementation of toString that uses
  reflection to print out the values of the attributes in a default
  formatting. This makes it a bit easier to implement AttributeImpl,
  because toString() was declared abstract before.

- Cloning is now done much more efficiently in
  captureState. The method figures out which unique AttributeImpl
  instances are contained as values in the attributes map, because
  those are the ones that need to be cloned. It creates a single
  linked list that supports deep cloning (in the inner class
  AttributeSource.State). AttributeSource keeps track of when this
  state changes, i.e. whenever new attributes are added to the
  AttributeSource. Only in that case will captureState recompute the
  state, otherwise it will simply clone the precomputed state and
  return the clone. restoreState(AttributeSource.State) walks the
  linked list and uses the copyTo() method of AttributeImpl to copy
  all values over into the attribute that the source stream
  (e.g. SinkTokenizer) uses. 

The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning. 

This issue contains one backwards-compatibility break:
TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for 
the explaination). Some of these core classes are not final and so one could 
override the next() or next(Token) methods. In this case, the backwards-wrapper 
would automatically use incrementToken(), because it is implemented, so the 
overridden method is never called. To prevent users from errors not visible 
during compilation or testing (the streams just behave wrong), this patch makes 
all implementation methods final (next(), next(Token), incrementToken()), 
whenever the class itsself is not final. This is a BW break, but users will 
clearly see, that they have done something unsupoorted and should better create 
a custom TokenFilter with their additional implementation (instead of extending 
a core implementation).

For further changing contrib token streams the following procedere should be 
used:

*  rewrite and replace next(Token)/next() implementations by new API
* if the 

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Description: 
This patch makes the following improvements to AttributeSource and
TokenStream/Filter:

- introduces interfaces for all Attributes. The corresponding
  implementations have the postfix 'Impl', e.g. TermAttribute and
  TermAttributeImpl. AttributeSource now has a factory for creating
  the Attribute instances; the default implementation looks for
  implementing classes with the postfix 'Impl'. Token now implements
  all 6 TokenAttribute interfaces.

- new method added to AttributeSource:
  addAttributeImpl(AttributeImpl). Using reflection it walks up in the
  class hierarchy of the passed in object and finds all interfaces
  that the class or superclasses implement and that extend the
  Attribute interface. It then adds the interface-instance mappings
  to the attribute map for each of the found interfaces.

- removes the set/getUseNewAPI() methods (including the standard
  ones). Instead it is now enough to only implement the new API,
  if one old TokenStream implements still the old API (next()/next(Token)),
  it is wrapped automatically. The delegation path is determined via
  reflection (the patch determines, which of the three methods was
  overridden).

- Token is no longer deprecated, instead it implements all 6 standard
  token interfaces (see above). The wrapper for next() and next(Token)
  uses this, to automatically map all attribute interfaces to one
  TokenWrapper instance (implementing all 6 interfaces), that contains
  a Token instance. next() and next(Token) exchange the inner Token
  instance as needed. For the new incrementToken(), only one
  TokenWrapper instance is visible, delegating to the currect reusable
  Token. This API also preserves custom Token subclasses, that maybe
  created by very special token streams (see example in Backwards-Test).

- AttributeImpl now has a default implementation of toString that uses
  reflection to print out the values of the attributes in a default
  formatting. This makes it a bit easier to implement AttributeImpl,
  because toString() was declared abstract before.

- Cloning is now done much more efficiently in
  captureState. The method figures out which unique AttributeImpl
  instances are contained as values in the attributes map, because
  those are the ones that need to be cloned. It creates a single
  linked list that supports deep cloning (in the inner class
  AttributeSource.State). AttributeSource keeps track of when this
  state changes, i.e. whenever new attributes are added to the
  AttributeSource. Only in that case will captureState recompute the
  state, otherwise it will simply clone the precomputed state and
  return the clone. restoreState(AttributeSource.State) walks the
  linked list and uses the copyTo() method of AttributeImpl to copy
  all values over into the attribute that the source stream
  (e.g. SinkTokenizer) uses. 

- Tee- and SinkTokenizer were deprecated, because they use
Token instances for caching. This is not compatible to the new API
using AttributeSource.State objects. You can still use the old
deprecated ones, but new features provided by new Attribute types
may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
which has a factory to create new Sink instances, that have compatible
attributes. Sink instances created by one Tee can also be added to
another Tee, as long as the attribute implementations are compatible
(it is not possible to add a sink from a tee using one Token instance
to a tee using the six separate attribute impls). In this case UOE is thrown.

The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning. 

This issue contains one backwards-compatibility break:
TokenStreams/Filters/Tokenizers should normally be final
(see LUCENE-1753 for the explaination). Some of these core classes are 
not final and so one could override the next() or next(Token) methods.
In this case, the backwards-wrapper would automatically use
incrementToken(), because it 

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734063#action_12734063
 ] 

Uwe Schindler commented on LUCENE-1448:
---

This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the tee, the 
second one overwrites the first one).

 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734063#action_12734063
 ] 

Uwe Schindler edited comment on LUCENE-1448 at 7/22/09 3:25 AM:


This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the sink, 
the second one overwrites the first one).

  was (Author: thetaphi):
This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the tee, the 
second one overwrites the first one).
  
 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734067#action_12734067
 ] 

Uwe Schindler commented on LUCENE-1644:
---

Sorry that I came back too late to this issue, I am in holidays at the moment.

In my opinion, the Parameter instead of boolean is a good idea. The latest 
patch is also a good idea, I only hve some small problems with it:
- Why did you make so many internal things public? The additional ctor to 
MultiTermQueryrapperFilter should be package-private or protected (the class is 
not abstract, but should be used like abstract, so it ,must have only protected 
ctors). Only the public instances TermRangeFilter should have public ctors.
- getFilter()/getEnum should stay protected.
- I do not like the wired caching of Terms. A more cleaner API would be a new 
class CachingFilteredTermEnum, that can turn on caching for e.g. the first 20 
terms and then reset. In this case, the API would stay clear and the filter 
code does not need to be changed at all (it just harvests the TermEnum, if it 
is cached or not). I would propose something like: new 
CachingFilteredTermEnum(originalEnum), use it normally, then termEnum.reset() 
to consume again and termEnum.purgeCache() if caching no longer needed and to 
be switched off (after the first 25 terms or so). The problem with 
MultiTermQueryWrapper filter is, that the filter is normally stateless (no 
reader or termenum). So normally the method getDocIdSet() should get the 
termenum or wrapper in addition to the indexreader. This is not very good (it 
took me some time, to understand, what you are doing). 

 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734070#action_12734070
 ] 

Uwe Schindler commented on LUCENE-1644:
---

The biggest problem is, that this caching gets completely wired with 
multi-segment indexes:
The rewriting is done on the top-level reader. In this case the boolean query 
would be built and the terms cached. If there are too many terms, it creates a 
filter instance with the cached terms.
The rewritten query is then executed against all sub-readers using the cached 
terms and a fixed term enum. Normally this would create a docidset for the 
current index reader, the rewrite did it for the top-level index reader - the 
wron doc ids are returned and so on. So you cannot reuse the collected terms 
from the rewrite operation in the getDocIdSet calls.

So please turn of this caching at all! As noted before, the important thing is, 
that the returned filter by rewrite is stateless and should not know anythis 
about index readers. The idex reader is passed in getDocIdSet any is different 
for non-optimized indexes.

You have seen no tests fail, because all RangeQuery tests use optimized indexes.

 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[ApacheCon US] Travel Assistance

2009-07-22 Thread Grant Ingersoll
The Travel Assistance Committee is taking in applications for those  
wanting
to attend ApacheCon US 2009 (Oakland) which takes place between the  
2nd and

6th November 2009.

The Travel Assistance Committee is looking for people who would like  
to be

able to attend ApacheCon US 2009 who may need some financial support in
order to get there. There are limited places available, and all  
applications
will be scored on their individual merit. Applications are open to all  
open
source developers who feel that their attendance would benefit  
themselves,

their project(s), the ASF and open source in general.

Financial assistance is available for flights, accommodation,  
subsistence
and Conference fees either in full or in part, depending on  
circumstances.

It is intended that all our ApacheCon events are covered, so it may be
prudent for those in Europe and/or Asia to wait until an event closer to
them comes up - you are all welcome to apply for ApacheCon US of  
course, but
there should be compelling reasons for you to attend an event further  
away
that your home location for your application to be considered above  
those

closer to the event location.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a  
link to

the online application and details for submitting.

Applications for applying for travel assistance will open on 27th July  
2009

and close of the 17th August 2009.

Good luck to all those that will apply.

Regards,

The Travel Assistance Committee



[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734096#action_12734096
 ] 

Michael McCandless commented on LUCENE-1644:


bq. The biggest problem is, that this caching gets completely wired with 
multi-segment indexes

Right, I caught this as well (there is one test that fails when I forcefully 
swap in constant-boolean-query as the constant score method), and I'm now 
turning off the caching.

I've fixed it locally -- will post a new rev soon.

 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-07-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734097#action_12734097
 ] 

Robert Muir commented on LUCENE-1460:
-

Michael, after 1728 I can take another look at this. the reason is, that I 
added some tests to these analyzers and found a bug in the Thai offsets.

When i submitted this, i only duplicated the existing behavior, but I don't 
want to reintroduce the bug into incrementToken()


 Change all contrib TokenStreams/Filters to use the new TokenStream API
 --

 Key: LUCENE-1460
 URL: https://issues.apache.org/jira/browse/LUCENE-1460
 Project: Lucene - Java
  Issue Type: Task
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
 LUCENE-1460_partial.txt


 Now that we have the new TokenStream API (LUCENE-1422) we should change all 
 contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Alves updated LUCENE-1486:
---

Attachment: junit_complex_phrase_qp_07_22_2009.patch

I added 2 testcases that return doc 3, but do not make much sense just to prove 
the point that we need more docs describing the use case for complex phrase qp, 
and define what is the subset of the supported syntax we want to support.

checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT
checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\~3,3); // proximity with fuzzy, OR, AND


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734141#action_12734141
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 7:55 AM:
-

I added 2 testcases that return doc 3.
These queries do not make much sense,
I added it just to prove the point that we need more information
describing the use case for complex phrase qp.
We also should define a subset of the supported syntax we want to support 
inside phrases, 
with well defined behaviors.

checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT
checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\~3,3); // proximity with fuzzy, OR, AND


  was (Author: lafa):
I added 2 testcases that return doc 3, but do not make much sense just to 
prove the point that we need more docs describing the use case for complex 
phrase qp, and define what is the subset of the supported syntax we want to 
support.

checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT
checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\~3,3); // proximity with fuzzy, OR, AND

  
 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734148#action_12734148
 ] 

Mark Harwood commented on LUCENE-1486:
--

I'll try and catch up with some of the issues raised here:

bq. What do you mean on the last check by phrase inside phrase, I don't see any 
phrase inside a phrase

Correct, the inner phrase example was a term not a phrase. This is perhaps a 
better example:

checkBadQuery(\jo* \percival smith\ \); //phrases inside 
phrases is bad

bq. I'm trying now to figure out what is supported 

The Junit is currently the main form of documentation - unlike the 
XMLQueryParser (which has a DTD) there is no syntax to formally capture the 
logic. 
Here is a basic summary of the syntax supported and how it differs from normal 
non-phrase use of the same operators:

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given 
phrase element  e.g. (john OR jonathon) smith 
* AND is irrelevant - there is effectively an implied AND_NEXT_TO binding 
all phrase elements 

To move this forward I would suggest we consider following one of these options:

1) Keep in core and improve error reporting and documentation
2) Move into contrib as experimental 
3) Retain in core but simplify it to support only the simplest syntax (as in my 
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable within phrase operators e.g. *, ~, ( ) 

I think 1) is achievable if we carefully define where the existing parser 
breaks (e.g. ANDs and nested brackets)
2) is unnecessary if we can achieve 1).
3) would be a shame if we lost useful features for some very convoluted edge 
cases
4) is beyond my JavaCC skills.



















 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734154#action_12734154
 ] 

Tim Smith commented on LUCENE-1754:
---

keeping null should be fine, as long as this is documented and all core query 
implementations handle this behavior, and all searcher code handles the null 
return properly
at this point, NonMatchingScorer could be removed and null returned in its 
place (being package private, no one writing applications can make any 
assumptions on a NonMatchingScorer being returned)

however, this should also be documented for the rewrite() method (currently 
this looks to always expect a non-null return value), also the searcher will 
throw null pointers if a null query is passed to it 



 Get rid of NonMatchingScorer from BooleanScorer2
 

 Key: LUCENE-1754
 URL: https://issues.apache.org/jira/browse/LUCENE-1754
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1754.patch


 Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
 from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
 can be easily done, so I'm going to post a patch shortly. For reference: 
 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
 I've marked the issue as 2.9 just because it's small, and kind of related to 
 all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-07-22 Thread Shai Erera
That's an interesting idea.

I always wonder however how much exactly would we gain, vs. the effort spent
to develop, debug and maintain it. Just some thoughts that we should
consider regarding this:

* For very large indices, where we think this will generally be good for, I
believe it's reasonable to assume that the search index will sit on its own
machine, or set of CPUs, RAM and HD. Therefore given that very few will run
on the OS other than the search index, I assume the OS cache will be enough
(if not better)?

* In other cases, where the search app runs together w/ other apps, I'm not
sure how much we'll gain. I can assume such apps will use a smaller index,
or will not need to support high query load? If so, will they really care if
we cache their data, vs. the OS?

Like I said, these are just thoughts. I don't mean to cancel the idea w/
them, just to think how much will it improve performance (vs. maybe even
hurt it?). Often I find it that some optimizations that are done will
benefit very large indices. But these usually get their decent share of
resources, and the JVM itself is run w/ larger heap etc. So these
optimizations turn out to not affect such indices much after all. And for
smaller indices, performance is usually not a problem (well ... they might
just fit entirely in RAM).

Shai

On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote:

 In discussions of Lucene search performance, the importance of OS caching
 of index data is frequently mentioned.  The typical recommendation is to
 keep plenty of unallocated RAM available (e.g. don't gobble it all up with
 your JVM heap) and try to avoid large I/O operations that would purge the OS
 cache.

 I'm curious if anyone has thought about (or even tried) caching the
 low-level index data in Java, rather than in the OS.  For example, at the
 IndexInput level there could be an LRU cache of byte[] blocks, similar to
 how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
 reads in 1k chunks.) You would reverse the advice above and instead make
 your JVM heap as large as possible (or at least large enough to achieve a
 desired speed/space tradeoff).

 This approach seems like it would have some advantages:

 - Explicit control over how much you want cached (adjust your JVM heap and
 cache settings as desired)
 - Cached index data won't be purged by the OS doing other things
 - Index warming might be faster, or at least more predictable

 The obvious disadvantage for some situations is that more RAM would now be
 tied up by the JVM, rather than managed dynamically by the OS.

 Any thoughts?  It seems like this would be pretty easy to implement
 (subclass FSDirectory, return subclass of FSIndexInput that checks the cache
 before reading, cache keyed on filename + position), but maybe I'm
 oversimplifying, and for that matter a similar implementation may already
 exist somewhere for all I know.

 Thanks,
 Chris



[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734164#action_12734164
 ] 

Michael McCandless commented on LUCENE-1754:


I think we continue to allow scorer() and getDocIdSet to return null to mean 
no matches, though they are not required too (ie, one cannot assume that a 
non-null return means there are matches).

And we should make this clear in the javadocs.

And remove NonMatchingScorer.

 Get rid of NonMatchingScorer from BooleanScorer2
 

 Key: LUCENE-1754
 URL: https://issues.apache.org/jira/browse/LUCENE-1754
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1754.patch


 Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
 from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
 can be easily done, so I'm going to post a patch shortly. For reference: 
 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
 I've marked the issue as 2.9 just because it's small, and kind of related to 
 all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734168#action_12734168
 ] 

Shai Erera commented on LUCENE-1754:


ok then I'll add a test case to the patch which uses QWF w/ a query that it's 
scorer returns null, and then fix IndexSearcher accordingly. And update the 
javadocs as needed.

 Get rid of NonMatchingScorer from BooleanScorer2
 

 Key: LUCENE-1754
 URL: https://issues.apache.org/jira/browse/LUCENE-1754
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1754.patch


 Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
 from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
 can be easily done, so I'm going to post a patch shortly. For reference: 
 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
 I've marked the issue as 2.9 just because it's small, and kind of related to 
 all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-07-22 Thread eks dev
imo, it is too low level to do it better than OSs. I agree, cache unloading 
effect would be prevented with it, but I am not sure if it brings net-net 
benefit, you would get this problem fixed, but probably OS would kill you 
anyhow (you took valuable memory from OS) on queries that miss your internal 
cache...  

We  could try to do better if we put more focus on higher levels and do the 
caching there... maybe even cache somhow some CPU work, e.g.  keep dense 
Postings in faster, less compressed format, load TermDictionary into 
RAMDirectory and keep the rest on disk.. Ideas in that direction have better 
chance to bring us forward. Take for example FuzzyQuery, there you can do some 
LRU caching at Term level and and save huge amounts of IO and CPU... 






From: Shai Erera ser...@gmail.com
To: java-dev@lucene.apache.org
Sent: Wednesday, 22 July, 2009 17:32:34
Subject: Re: Java caching of low-level index data?


That's an interesting idea.

I always wonder however how much exactly would we gain, vs. the effort spent 
to develop, debug and maintain it. Just some thoughts that we should consider 
regarding this:

* For very large indices, where we think this will generally be good for, I 
believe it's reasonable to assume that the search index will sit on its own 
machine, or set of CPUs, RAM and HD. Therefore given that very few will run on 
the OS other than the search index, I assume the OS cache will be enough (if 
not better)?

* In other cases, where the search app runs together w/ other apps, I'm not 
sure how much we'll gain. I can assume such apps will use a smaller index, or 
will not need to support high query load? If so, will they really care if we 
cache their data, vs. the OS?

Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, 
just to think how much will it improve performance (vs. maybe even hurt it?). 
Often I find it that some optimizations that are done will benefit very large 
indices. But these usually get their decent share of resources, and the JVM 
itself is run w/ larger heap etc. So these optimizations turn out to not 
affect such indices much after all. And for smaller indices, performance is 
usually not a problem (well ... they might just fit entirely in RAM).

Shai


On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote:

In discussions of Lucene search performance, the importance of OS caching of 
index data is frequently mentioned.  The typical recommendation is to keep 
plenty of unallocated RAM available (e.g. don't gobble it all up with your 
JVM heap) and try to avoid large I/O operations that would purge the OS 
cache.

I'm curious if anyone has thought about (or even tried) caching the low-level 
index data in Java, rather than in the OS.  For example, at the IndexInput 
level there could be an LRU cache of byte[] blocks, similar to how a RDBMS 
caches index pages.  (Conveniently, BufferedIndexInput already reads in 1k 
chunks.) You would reverse the advice above and instead make your JVM heap as 
large as possible (or at least large enough to achieve a desired speed/space 
tradeoff). 

This approach seems like it would have some advantages:

- Explicit control over how much you want cached (adjust your JVM heap and 
cache settings as desired)
- Cached index data won't be purged by the OS doing other things

- Index warming might be faster, or at least more predictable

The obvious disadvantage for some situations is that more RAM would now be 
tied up by the JVM, rather than managed dynamically by the OS.

Any thoughts?  It seems like this would be pretty easy to implement (subclass 
FSDirectory, return subclass of FSIndexInput that checks the cache before 
reading, cache keyed on filename + position), but maybe I'm oversimplifying, 
and for that matter a similar implementation may already exist somewhere for 
all I know.

Thanks,
Chris




  

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734169#action_12734169
 ] 

Michael McCandless commented on LUCENE-1076:


maxDoc() does reflect the number of docs in the index.  It's simply the sum of 
docCount for all segments.  Shuffling the order of the segments, or allowing 
non-contiguous segments to be merged, won't change how maxDoc() is computed.

New docIDs are allocating by incrementing an integer (starting with 0) for the 
buffered docs.  When a segment gets flushed, we reset that to 0.  Ie, docIDs 
are stored within one segment; they have no context from prior segments.

 Allow MergePolicy to select non-contiguous merges
 -

 Key: LUCENE-1076
 URL: https://issues.apache.org/jira/browse/LUCENE-1076
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1076.patch


 I started work on this but with LUCENE-1044 I won't make much progress
 on it for a while, so I want to checkpoint my current state/patch.
 For backwards compatibility we must leave the default MergePolicy as
 selecting contiguous merges.  This is necessary because some
 applications rely on temporal monotonicity of doc IDs, which means
 even though merges can re-number documents, the renumbering will
 always reflect the order in which the documents were added to the
 index.
 Still, for those apps that do not rely on this, we should offer a
 MergePolicy that is free to select the best merges regardless of
 whether they are continuguous.  This requires fixing IndexWriter to
 accept such a merge, and, fixing LogMergePolicy to optionally allow
 it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges

2009-07-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734174#action_12734174
 ] 

Shai Erera commented on LUCENE-1076:


Oh. Thanks for correcting me. In that case, I take what I said back.

I think this together w/ LUCENE-1750 can really help speed up segment merges in 
certain scenarios. Will wait for you to come back to it :)

 Allow MergePolicy to select non-contiguous merges
 -

 Key: LUCENE-1076
 URL: https://issues.apache.org/jira/browse/LUCENE-1076
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1076.patch


 I started work on this but with LUCENE-1044 I won't make much progress
 on it for a while, so I want to checkpoint my current state/patch.
 For backwards compatibility we must leave the default MergePolicy as
 selecting contiguous merges.  This is necessary because some
 applications rely on temporal monotonicity of doc IDs, which means
 even though merges can re-number documents, the renumbering will
 always reflect the order in which the documents were added to the
 index.
 Still, for those apps that do not rely on this, we should offer a
 MergePolicy that is free to select the best merges regardless of
 whether they are continuguous.  This requires fixing IndexWriter to
 accept such a merge, and, fixing LogMergePolicy to optionally allow
 it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734176#action_12734176
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Hey Mark. Have you made any progress with that?

Apologies, recently the lure of developing apps for the new iPhone has put paid 
to that :)

I'm still happy that the pseudo-code we outlined in the last couple of comments 
is what is needed to finish this.

bq.We can tag team if you want 

Yep, happy to do that. Let me know if you start work to avoid me duplicating 
effort and I'll do the same.

Cheers
Mark



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-07-22 Thread Michael McCandless
I think it's a neat idea!

But you are in fact fighting the OS so I'm not sure how well this'll
work in practice.

EG the OS will happily swap out pages from your process if it thinks
you're not using them, so it'd easily swap out your cache in favor of
its own IO cache (this is the swappiness configuration on Linux),
which would then kill performance (take a page hit when you finally
did need to use your cache).  In C (possibly requiring root) you could
wire the pages, but we can't do that from javaland, so it's already
not a fair fight.

Mike

On Wed, Jul 22, 2009 at 11:56 AM, eks deveks...@yahoo.co.uk wrote:
 imo, it is too low level to do it better than OSs. I agree, cache unloading
 effect would be prevented with it, but I am not sure if it brings net-net
 benefit, you would get this problem fixed, but probably OS would kill you
 anyhow (you took valuable memory from OS) on queries that miss your internal
 cache...

 We could try to do better if we put more focus on higher levels and do the
 caching there... maybe even cache somhow some CPU work, e.g.  keep dense
 Postings in faster, less compressed format, load TermDictionary into
 RAMDirectory and keep the rest on disk.. Ideas in that direction have better
 chance to bring us forward. Take for example FuzzyQuery, there you can do
 some LRU caching at Term level and and save huge amounts of IO and CPU...




 From: Shai Erera ser...@gmail.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, 22 July, 2009 17:32:34
 Subject: Re: Java caching of low-level index data?

 That's an interesting idea.

 I always wonder however how much exactly would we gain, vs. the effort spent
 to develop, debug and maintain it. Just some thoughts that we should
 consider regarding this:

 * For very large indices, where we think this will generally be good for, I
 believe it's reasonable to assume that the search index will sit on its own
 machine, or set of CPUs, RAM and HD. Therefore given that very few will run
 on the OS other than the search index, I assume the OS cache will be enough
 (if not better)?

 * In other cases, where the search app runs together w/ other apps, I'm not
 sure how much we'll gain. I can assume such apps will use a smaller index,
 or will not need to support high query load? If so, will they really care if
 we cache their data, vs. the OS?

 Like I said, these are just thoughts. I don't mean to cancel the idea w/
 them, just to think how much will it improve performance (vs. maybe even
 hurt it?). Often I find it that some optimizations that are done will
 benefit very large indices. But these usually get their decent share of
 resources, and the JVM itself is run w/ larger heap etc. So these
 optimizations turn out to not affect such indices much after all. And for
 smaller indices, performance is usually not a problem (well ... they might
 just fit entirely in RAM).

 Shai

 On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote:

 In discussions of Lucene search performance, the importance of OS caching
 of index data is frequently mentioned.  The typical recommendation is to
 keep plenty of unallocated RAM available (e.g. don't gobble it all up with
 your JVM heap) and try to avoid large I/O operations that would purge the OS
 cache.

 I'm curious if anyone has thought about (or even tried) caching the
 low-level index data in Java, rather than in the OS.  For example, at the
 IndexInput level there could be an LRU cache of byte[] blocks, similar to
 how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
 reads in 1k chunks.) You would reverse the advice above and instead make
 your JVM heap as large as possible (or at least large enough to achieve a
 desired speed/space tradeoff).

 This approach seems like it would have some advantages:

 - Explicit control over how much you want cached (adjust your JVM heap and
 cache settings as desired)
 - Cached index data won't be purged by the OS doing other things
 - Index warming might be faster, or at least more predictable

 The obvious disadvantage for some situations is that more RAM would now be
 tied up by the JVM, rather than managed dynamically by the OS.

 Any thoughts?  It seems like this would be pretty easy to implement
 (subclass FSDirectory, return subclass of FSIndexInput that checks the cache
 before reading, cache keyed on filename + position), but maybe I'm
 oversimplifying, and for that matter a similar implementation may already
 exist somewhere for all I know.

 Thanks,
 Chris




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1644:
---

Attachment: LUCENE-1644.patch

Attached patch: fixed some bugs in the last rev, updated test cases,
javadocs, CHANGES.  I also optimized MultiTermQueryWrapperFilter to
use the bulk-read API from termDocs.

I confirmed all tests pass if I temporarily switch
CONSTANT_SCORE_FILTER_REWRITE to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT.

I changed QueryParser to use CONSTANT_SCORE_AUTO for rewrite (it was
previously CONSTANT_FILTER).

I still need to run some perf tests to get a rough sense of decent
defaults for CONSTANT_SCORE_AUTO cutover thresholds.

bq. getFilter()/getEnum should stay protected.

OK I made getEnum protected again.

I had tentatively made it public so that one could create their own
[external] rewrite methods.  But I think (if we leave it protected),
one could still make an inner/nested class that can access getEnum().

Do we even need getFilter()?  I removed it in the patch.


 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges

2009-07-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1076:
--

Assignee: (was: Michael McCandless)

Unassigning myself.

 Allow MergePolicy to select non-contiguous merges
 -

 Key: LUCENE-1076
 URL: https://issues.apache.org/jira/browse/LUCENE-1076
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1076.patch


 I started work on this but with LUCENE-1044 I won't make much progress
 on it for a while, so I want to checkpoint my current state/patch.
 For backwards compatibility we must leave the default MergePolicy as
 selecting contiguous merges.  This is necessary because some
 applications rely on temporal monotonicity of doc IDs, which means
 even though merges can re-number documents, the renumbering will
 always reflect the order in which the documents were added to the
 index.
 Still, for those apps that do not rely on this, we should offer a
 MergePolicy that is free to select the best merges regardless of
 whether they are continuguous.  This requires fixing IndexWriter to
 accept such a merge, and, fixing LogMergePolicy to optionally allow
 it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734190#action_12734190
 ] 

Michael McCandless commented on LUCENE-1076:


bq. Will wait for you to come back to it

Feel free to take it, too :)

I think LUCENE-1737 is also very important for speeding up merging, especially 
because it's so unexpected that just by adding different fields to your docs, 
or the same fields in different orders, can so severely impact merge 
performance.

 Allow MergePolicy to select non-contiguous merges
 -

 Key: LUCENE-1076
 URL: https://issues.apache.org/jira/browse/LUCENE-1076
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1076.patch


 I started work on this but with LUCENE-1044 I won't make much progress
 on it for a while, so I want to checkpoint my current state/patch.
 For backwards compatibility we must leave the default MergePolicy as
 selecting contiguous merges.  This is necessary because some
 applications rely on temporal monotonicity of doc IDs, which means
 even though merges can re-number documents, the renumbering will
 always reflect the order in which the documents were added to the
 index.
 Still, for those apps that do not rely on this, we should offer a
 MergePolicy that is free to select the best merges regardless of
 whether they are continuguous.  This requires fixing IndexWriter to
 accept such a merge, and, fixing LogMergePolicy to optionally allow
 it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1754:
---

Attachment: LUCENE-1754.patch

* Added a test case to TestDocIdSet which returns a null DocIdSet and indeed 
IndexSearcher failed.
* Fixed IndexSearcher and all other places in the code which called scorer() or 
getDocIdSet() and could potentially hit NPE.
* Added EmptyDocIdSetIterator for use by classes (such as ChainFilter) that 
need a DISI, but got a null DocIdSet.
* Updated CHANGES.

All search tests pass.

 Get rid of NonMatchingScorer from BooleanScorer2
 

 Key: LUCENE-1754
 URL: https://issues.apache.org/jira/browse/LUCENE-1754
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1754.patch, LUCENE-1754.patch


 Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
 from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
 can be easily done, so I'm going to post a patch shortly. For reference: 
 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
 I've marked the issue as 2.9 just because it's small, and kind of related to 
 all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-07-22 Thread eks dev

this should not be all that difficult to try. I accept it makes sense in some 
cases ... but which ones?
Background: all my attempts to fight OS went bed :( 

Let us think again what does it mean what Mike gave as an example?

You are explicitly deciding that Lucene should get bigger share of RAM. OS will 
unload these pages 
 if OS needs Lucene  RAM for something else and you are not using them. Right?

If something else should get less resources, we are on target, but this is 
end result. For any shared setup where you have many things that run, this 
decision has its consequences, something else is going to be starved. 

The other case, where only lucene runs, well what is the difference if we evict 
unused pages or OS does it (better control is just what we get on benefit)? 
This is the case where you are anyhow in not really comfortable for real 
caching situation, otherwise even greedy OSs wouldn't swap (at least my 
experience with reasonably configured OSs)... 

after thinking about it again, I would say, yes, there are for sure some cases 
where it helps, but not many cases and even in these cases benefit will be 
small.

I guess :)






- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, 22 July, 2009 18:37:19
 Subject: Re: Java caching of low-level index data?
 
 I think it's a neat idea!
 
 But you are in fact fighting the OS so I'm not sure how well this'll
 work in practice.
 
 EG the OS will happily swap out pages from your process if it thinks
 you're not using them, so it'd easily swap out your cache in favor of
 its own IO cache (this is the swappiness configuration on Linux),
 which would then kill performance (take a page hit when you finally
 did need to use your cache).  In C (possibly requiring root) you could
 wire the pages, but we can't do that from javaland, so it's already
 not a fair fight.
 
 Mike
 
 On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
  imo, it is too low level to do it better than OSs. I agree, cache unloading
  effect would be prevented with it, but I am not sure if it brings net-net
  benefit, you would get this problem fixed, but probably OS would kill you
  anyhow (you took valuable memory from OS) on queries that miss your internal
  cache...
 
  We could try to do better if we put more focus on higher levels and do the
  caching there... maybe even cache somhow some CPU work, e.g.  keep dense
  Postings in faster, less compressed format, load TermDictionary into
  RAMDirectory and keep the rest on disk.. Ideas in that direction have better
  chance to bring us forward. Take for example FuzzyQuery, there you can do
  some LRU caching at Term level and and save huge amounts of IO and CPU...
 
 
 
 
  From: Shai Erera 
  To: java-dev@lucene.apache.org
  Sent: Wednesday, 22 July, 2009 17:32:34
  Subject: Re: Java caching of low-level index data?
 
  That's an interesting idea.
 
  I always wonder however how much exactly would we gain, vs. the effort spent
  to develop, debug and maintain it. Just some thoughts that we should
  consider regarding this:
 
  * For very large indices, where we think this will generally be good for, I
  believe it's reasonable to assume that the search index will sit on its own
  machine, or set of CPUs, RAM and HD. Therefore given that very few will run
  on the OS other than the search index, I assume the OS cache will be enough
  (if not better)?
 
  * In other cases, where the search app runs together w/ other apps, I'm not
  sure how much we'll gain. I can assume such apps will use a smaller index,
  or will not need to support high query load? If so, will they really care if
  we cache their data, vs. the OS?
 
  Like I said, these are just thoughts. I don't mean to cancel the idea w/
  them, just to think how much will it improve performance (vs. maybe even
  hurt it?). Often I find it that some optimizations that are done will
  benefit very large indices. But these usually get their decent share of
  resources, and the JVM itself is run w/ larger heap etc. So these
  optimizations turn out to not affect such indices much after all. And for
  smaller indices, performance is usually not a problem (well ... they might
  just fit entirely in RAM).
 
  Shai
 
  On Wed, Jul 22, 2009 at 6:21 PM, Nigel wrote:
 
  In discussions of Lucene search performance, the importance of OS caching
  of index data is frequently mentioned.  The typical recommendation is to
  keep plenty of unallocated RAM available (e.g. don't gobble it all up with
  your JVM heap) and try to avoid large I/O operations that would purge the 
  OS
  cache.
 
  I'm curious if anyone has thought about (or even tried) caching the
  low-level index data in Java, rather than in the OS.  For example, at the
  IndexInput level there could be an LRU cache of byte[] blocks, similar to
  how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
  reads in 1k chunks.) You would 

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734202#action_12734202
 ] 

Michael McCandless commented on LUCENE-1754:


For some reason I can't apply the patch -- I get this:
{code}
$ patch -p0  /x/tmp/LUCENE-1754.patch.txt 
patching file CHANGES.txt
patch:  malformed patch at line 20: @@ -629,6 +638,11 @@
{code}

 Get rid of NonMatchingScorer from BooleanScorer2
 

 Key: LUCENE-1754
 URL: https://issues.apache.org/jira/browse/LUCENE-1754
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1754.patch, LUCENE-1754.patch


 Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
 from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
 can be easily done, so I'm going to post a patch shortly. For reference: 
 https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
 I've marked the issue as 2.9 just because it's small, and kind of related to 
 all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734292#action_12734292
 ] 

Michael Busch commented on LUCENE-1448:
---

Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is 
committed.

 add getFinalOffset() to TokenStream
 ---

 Key: LUCENE-1448
 URL: https://issues.apache.org/jira/browse/LUCENE-1448
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
 LUCENE-1448.patch


 If you add multiple Fieldable instances for the same field name to a 
 document, and you then index those fields with TermVectors storing offsets, 
 it's very likely the offsets for all but the first field instance will be 
 wrong.
 This is because IndexWriter under the hood adds a cumulative base to the 
 offsets of each field instance, where that base is 1 + the endOffset of the 
 last token it saw when analyzing that field.
 But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
 is being used, and the text being analyzed ended in 3 whitespace characters, 
 then that information is lost and then next field's offsets are then all 3 
 too small.  Similarly, if a StopFilter appears in the chain, and the last N 
 tokens were stop words, then the base will be 1 + the endOffset of the last 
 non-stopword token.
 To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
 thinking by default it returns -1, which means I don't know so you figure it 
 out, meaning we fallback to the faulty logic we have today.
 This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734296#action_12734296
 ] 

Michael Busch commented on LUCENE-1486:
---

I think the best thing to do here is do exactly define what syntax is supposed 
to be supported (which Mark H. did in his latest comment), and then implement 
the new syntax with the new queryparser. It will enforce correct syntax and 
give meaningful exceptions if a query was entered that is not supported.

I think we can still reuse big portions of Mark's patch: we should be able to 
write a new QueryBuilder that produces the new ComplexPhraseQuery.

Adriano/Luis: how long would it take to implement? Can we contain it for 2.9?

This would mean that these new features would go into contrib in 2.9 as part of 
the new query parser framework, and then be moved to core in 3.0. Also from 3.0 
these new features would then be part of Lucene's main query syntax. Would this 
makes sense?

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734301#action_12734301
 ] 

Uwe Schindler edited comment on LUCENE-1644 at 7/22/09 1:50 PM:


Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public). And the same should be 
for the incNumberOfTerms (also protected). I think the rewrite method is 
internal to MultiTermQuery and always implemented ina subclass of MTQ as inner 
class.

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the singleton instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct? So one could modify the defaults by setting properties in this 
instance. Is this correct?

  was (Author: thetaphi):
Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public).

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the singleton instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct? So one could modify the defaults by setting properties in this 
instance. Is this correct?
  
 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323
 ] 

Luis Alves commented on LUCENE-1486:


Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6

  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be return by or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this does not seem to be working

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
can you describe what is the behavior here.
Look like the and is convert into a OR, that the case.
What is the behavior you want to implement.




 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:19 PM:
-

Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{monospaced}
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
{monospaced}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{{monospaced}}
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
{{monospaced}}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:21 PM:
-

Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
...
{code}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{monospaced}
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
{monospaced}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:24 PM:
-

Mark H - 

Question 1)

I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
...
{code}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned.
Is this the correct behavior?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
for query:
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
seems that like the AND is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData(john smith, 1),
  new DocData(johathon smith, 2),  
  new DocData(john percival smith goes on  a b c vacation, 3),
  new DocData(jackson waits tom, 4),
  new DocData(johathon smith john, 5),
  new DocData(johathon mary gomes smith, 6),
  };
...
{code}

for test 
checkMatches(\(jo* -john) smyth\, 2); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

Question 3)
checkMatches(\jo*  smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary)  smith\, 1,2,5); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734333#action_12734333
 ] 

Luis Alves commented on LUCENE-1486:


Sorry for all the emails, 
I'm still new to JIRA and only now I realized that for every edit I do,a email 
is sent.

But now that I found the preview button, it won't happen again. :)


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734337#action_12734337
 ] 

Mark Harwood commented on LUCENE-1486:
--

bq. I think it's not a big deal, but I'm just trying to understand and raise a 
probable wrong test.

Granted, the test fails for a reason other than the one for which I wanted it 
to fail. 
We can probably strike the test and leave a note saying phrase-within-a-phrase 
just does not make sense and is not supported.

bq.  Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or 
the default boolean operator (usually OR)?

In brackets it's an OR - the brackets are used to suggest that the current 
phrase element at position X is composed of some choices that are evaluated as 
a subclause in the same way that in normal query logic sub-clauses are defined 
in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on 
while evaluating the bracketed innards of phrases just in case the base class 
has AND as the default.

bq. Mark H, can you please elaborate more on the these other operators + - 
^ AND  || NOT ! : [ ] { }.

OK I'll try and deal with them one by one but these are not necessarily 
definitive answers or guarantees of correctly implemented support

OR,||,+, AND,  . ignored. The implicit operator is AND_NEXT_TO apart from 
in bracketed sections where all elements at this level are ORed
^ .boosts are carried through from TermQuerys to SpanTermQuerys
NOT, ! Creates SpanNotQueries 
[]{} range queries are supported as are wildcards *, fuzzies  ~, ?

bq. query: '(john OR jonathon) smith~0.3 order*' order:sell stock market


I'll post the XML query syntax equivalent of what should be parsed here shortly 
(just seen your next comment come in) 





 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734349#action_12734349
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}for test checkMatches(\(jo* -john) smyth\, 2); 
would document 5 be returned or just doc 2 should be returned,
{quote}

I presume you mean smith not smyth here otherwise nothing would match? If so, 
doc 5 should match and position is relevant (subject to slop factors).

{quote}
Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work
{quote}

I suppose there's an open question as to if the second example is legal (the 
brackets are unnecessary)



{quote}
Question 3)
checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.
{quote}

That looks like a bug related to slop factor?

{quote}
Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with
{quote}
ANDs are ignored and turned into ORs (see earlier comments) but maybe a query 
parse error should be thrown to emphasise this.





 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



getTermInfosIndexDivisor deprecated?

2009-07-22 Thread Jason Rutherglen
It's a get method but the UnsupportedOperationException says Please
pass termInfosIndexDivisor up-front when opening IndexReader?  I did
pass it in.  Writing a test case for Solr that checks it.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734355#action_12734355
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}
query: '(john OR jonathon) smith~0.3 order*' order:sell stock market
{quote}
Would be parsed as follows (shown as equivalent XMLQueryParser syntax)
{code:xml} 
BooleanQuery
  Clause occurs=should
 SpanNear 
SpanOr
SpanOrTermsjohn jonathon /SpanOrTerms
/SpanOr
SpanOr
SpanOrTerms smith smyth/SpanOrTerms
/SpanOr
SpanOr
SpanOrTerms order orders/SpanOrTerms
/SpanOr
   /SpanNear
 /Clause
Clause occurs=should
 TermQuery fieldName=order sell/TermQuery 
 /Clause
Clause occurs=should
 UserQuerystock market/UserQuery  
 /Clause
/BooleanQuery 
{code}


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: getTermInfosIndexDivisor deprecated?

2009-07-22 Thread Michael McCandless
Yeah this was deprecated in LUCENE-1609; I guess we could keep the
getter alive?  I'll reopen it.

Mike

On Wed, Jul 22, 2009 at 6:07 PM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 It's a get method but the UnsupportedOperationException says Please
 pass termInfosIndexDivisor up-front when opening IndexReader?  I did
 pass it in.  Writing a test case for Solr that checks it.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-07-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1609:



Reopening to un-deprecate getTermInfosIndexDivisor.

 Eliminate synchronization contention on initial index reading in 
 TermInfosReader ensureIndexIsRead 
 ---

 Key: LUCENE-1609
 URL: https://issues.apache.org/jira/browse/LUCENE-1609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
 Environment: Solr 
 Tomcat 5.5
 Ubuntu 2.6.20-17-generic
 Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
Reporter: Dan Rosher
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, 
 LUCENE-1609.patch


 synchronized method ensureIndexIsRead in TermInfosReader causes contention 
 under heavy load
 Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
 range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
 docs) and under a load/stress test application, and later, examining the 
 Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
 entry' to this method.
 Rather than using Double-Checked Locking which is known to have issues, this 
 implementation uses a state pattern, where only one thread can move the 
 object from IndexNotRead state to IndexRead, and in doing so alters the 
 objects behavior, i.e. once the index is loaded, the index nolonger needs a 
 synchronized method. 
 In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [ApacheCon US] Travel Assistance

2009-07-22 Thread Chris Hostetter

: Is the assistance restricted to people presenting and committers?

nope...

http://www.apache.org/travel/index.html


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: getTermInfosIndexDivisor deprecated?

2009-07-22 Thread Michael McCandless
OK done.

Mike

On Wed, Jul 22, 2009 at 7:37 PM, Michael
McCandlessluc...@mikemccandless.com wrote:
 Yeah this was deprecated in LUCENE-1609; I guess we could keep the
 getter alive?  I'll reopen it.

 Mike

 On Wed, Jul 22, 2009 at 6:07 PM, Jason
 Rutherglenjason.rutherg...@gmail.com wrote:
 It's a get method but the UnsupportedOperationException says Please
 pass termInfosIndexDivisor up-front when opening IndexReader?  I did
 pass it in.  Writing a test case for Solr that checks it.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-07-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1609.


Resolution: Fixed

 Eliminate synchronization contention on initial index reading in 
 TermInfosReader ensureIndexIsRead 
 ---

 Key: LUCENE-1609
 URL: https://issues.apache.org/jira/browse/LUCENE-1609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
 Environment: Solr 
 Tomcat 5.5
 Ubuntu 2.6.20-17-generic
 Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
Reporter: Dan Rosher
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, 
 LUCENE-1609.patch


 synchronized method ensureIndexIsRead in TermInfosReader causes contention 
 under heavy load
 Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
 range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
 docs) and under a load/stress test application, and later, examining the 
 Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
 entry' to this method.
 Rather than using Double-Checked Locking which is known to have issues, this 
 implementation uses a state pattern, where only one thread can move the 
 object from IndexNotRead state to IndexRead, and in doing so alters the 
 objects behavior, i.e. once the index is loaded, the index nolonger needs a 
 synchronized method. 
 In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734398#action_12734398
 ] 

Adriano Crestani commented on LUCENE-1486:
--

{quote}
I propose doing this using using the new QP implementation. (I can write the 
new javacc QP for this)
(this implies that the code will be in contrib in 2.9 and be part of core on 
3.0)
{quote}

That would be good!

{quote}
Granted, the test fails for a reason other than the one for which I wanted it 
to fail.
We can probably strike the test and leave a note saying phrase-within-a-phrase 
just does not make sense and is not supported.
{quote}

Cool, I agree to remove it. But I still don't see how an user can type a phrase 
inside a phrase with the current syntax definition, can you give me an example?

{quote}
In brackets it's an OR - the brackets are used to suggest that the current 
phrase element at position X is composed of some choices that are evaluated as 
a subclause in the same way that in normal query logic sub-clauses are defined 
in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on 
while evaluating the bracketed innards of phrases just in case the base class 
has AND as the default.
{quote}

If we use the implemented java cc code Luis suggested, we would have already a 
query parser that throws ParseExceptions whenever the user types an AND inside 
a phrase.

{quote}
OR,||,+, AND,  . ignored
{quote}

So we should throw an excpetion if any of these is found inside a phrase. It 
could confuse the user if we just ignore it.

{quote}
Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches(\john -percival\, 1); // not logic doesn't work
// checkMatches(\john (-percival)\, 1); // not logic doesn't work

I suppose there's an open question as to if the second example is legal (the 
brackets are unnecessary)
{quote}

Yes, the second is unnecessary, but I don't think it's illegal. The user could 
type (smith) outside the phrase, it makes sense to support it inside also.

{quote}
Question 3)
checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

That looks like a bug related to slop factor?
{quote}

I have not checked yet, but I think it's working fine. The slop means how many 
switches between the terms inside the phrase is allowed to match the query. It 
matches doc 6, because the term smith switches twice to the right and matched 
johathon mary gomes smith. Twice = slop 2 :)

{quote}
ANDs are ignored and turned into ORs (see earlier comments) but maybe a query 
parse error should be thrown to emphasise this.
{quote}

I think we could support AND also. I agree there are few cases where the user 
would use that. It would work as I explained before:

{quote}
What happens if I type (query AND parser) lucene. In my point of view it is: 
(query AND parser) AND_NEXT_TO lucene. Which means for me: find any document 
that contains the term 'query' and the term 'parser' in the position x, and the 
term 'lucene' in the position x+1. Is this the expected behaviour?
{quote}


 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 junit_complex_phrase_qp_07_21_2009.patch, 
 junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
 LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically 

Re: Lucene 2.9 Again

2009-07-22 Thread Chris Hostetter

: LUCENE-1749 FieldCache introspection API Unassigned 16/Jul/09
: 
:   You have time to work on this Hoss?

i'd have more time if there weren't so many darn solr-user questions that 
no one else answers.

The meat of the patch (adding an API to inspect the cache) could be 
commited as is today -- i just don't know if the API makes sense (needs 
more eyeballs), and the real value add will be getting the sanity testing 
utilities in place ... those are only about half done.

i'll try to work on it more this week(end) but if there isn't any progress 
from me, someone else (ahem: Miller?) should probably prune it down to 
the core function, add whatever javadocs are missing, and commit.

(better to have release with a simple inspection API then to delay 
releasing while a fancy inspection methods gets hashed out)



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-07-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734411#action_12734411
 ] 

Michael McCandless commented on LUCENE-1644:


bq. I was a little bit confused about the high term number cut off,

Sorry I still need to do some perf testing to pick an appropriate
default here.

bq.  Everywhere else (even in the backwards compatibility override test 
[JustCompile] it is public).  And the same should be for the incNumberOfTerms 
(also protected).

Woops -- I'll fix.  Thanks for catching even though you're on
vacation ;)

bq. Also the current singletons are not really singletons, because queries that 
are unserialized will contain instances that are not the singleton instances

Sigh.  I'll do what FieldCache's parser singletons do.

bq. The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct?

I was thinking this was OK, ie, you could set the default cutoffs for
anything that used the AUTO_DEFAULT.  But it is static (global), so
that's not great.  I guess I'll make it an anonymous subclass of
ConstantScoreAutoRewrite that disallows changes.


 Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
 the hood
 ---

 Key: LUCENE-1644
 URL: https://issues.apache.org/jira/browse/LUCENE-1644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch


 When MultiTermQuery is used (via one of its subclasses, eg
 WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
 constant score mode, which pre-builds a filter and then wraps that
 filter as a ConstantScoreQuery.
 If you don't set that, it instead builds a [potentially massive]
 BooleanQuery with one SHOULD clause per term.
 There are some limitations of this approach:
   * The scores returned by the BooleanQuery are often quite
 meaningless to the app, so, one should be able to use a
 BooleanQuery yet get constant scores back.  (Though I vaguely
 remember at least one example someone raised where the scores were
 useful...).
   * The resulting BooleanQuery can easily have too many clauses,
 throwing an extremely confusing exception to newish users.
   * It'd be better to have the freedom to pick build filter up front
 vs build massive BooleanQuery, when constant scoring is enabled,
 because they have different performance tradeoffs.
   * In constant score mode, an OpenBitSet is always used, yet for
 sparse bit sets this does not give good performance.
 I think we could address these issues by giving BooleanQuery a
 constant score mode, then empower MultiTermQuery (when in constant
 score mode) to pick  choose whether to use BooleanQuery vs up-front
 filter, and finally empower MultiTermQuery to pick the best (sparse vs
 dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1756) contrib/memory: PatternAnalyzerTest is a very, very, VERY, bad unit test

2009-07-22 Thread Hoss Man (JIRA)
contrib/memory: PatternAnalyzerTest is a very, very, VERY, bad unit test


 Key: LUCENE-1756
 URL: https://issues.apache.org/jira/browse/LUCENE-1756
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Hoss Man
Priority: Minor


while working on something else i was started getting consistent 
IllegalStateExceptions from PatternAnalyzerTest -- but only when running the 
test from the top level.

Digging into the test, i've found numerous things that are very scary...
* instead of using assertions to test that tokens streams match, it throws an 
IllegalStateExceptions when they don't, and then logs a bunch of info about the 
token streams to System.out -- having assertion messages that tell you 
*exactly* what doens't match would make a lot more sense.
* it builds up a list of files to analyze using patsh thta it evaluates 
relative to the current working directory -- which means you get different 
files depending on wether you run the tests fro mthe contrib level, or from the 
top level build file
* the list of files it looks for include: ../../*.txt, ../../*.html, 
../../*.xml ... so not only do you get different results when you run the 
tests in the contrib vs at the top level, but different people runing the tests 
via the top level build file will get different results depending on what types 
of text, html, and xml files they happen to have two directories above where 
they checked out lucene.
* the test comments indicates that it's purpose is to show that PatternAnalyzer 
produces the same tokens as other analyzers - but points out this will fail for 
WhitespaceAnalyzer because of the 255 character token limit WhitespaceTokenizer 
imposes -- the test then proceeds to compare PaternAnalyzer to 
WhitespaceTokenizer, garunteeing a test failure for anyone who happens to have 
a text file containing more then 255 characters of non-whitespace in a row 
somewhere in ../../ (in my case: my bookmarks.html file, and the hex encoded 
favicon.gif images)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org