date:20090722


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734022#action_12734022
 ] 

Michael Busch commented on LUCENE-1448:
---

OK I think I have this basically working with old and new API (including 1693 
changes).

The approach I took is fairly simple, it doesn't require adding a new 
Attribute. I added the following method to TokenSteam:

{code:java}
  /**
   * This method is called by the consumer after the last token has been 
consumed, 
   * i.e. after {...@link #incrementToken()} returned false (using 
the new TokenStream API)
   * or after {...@link #next(Token)} or {...@link #next()} returned 
null (old TokenStream API).
   * 
   * This method can be used to perform any end-of-stream operations, such as 
setting the final
   * offset of a stream. The final offset of a stream might differ from the 
offset of the last token
   * e.g. in case one or more whitespaces followed after the last token, but a 
{...@link WhitespaceTokenizer}
   * was used.
   * 
   * 
   * @throws IOException
   */
  public void end() throws IOException {
// do nothing by default
  }
{code}

Then I took Mike's patch and implemented end() in all classes where his patch 
added getFinalOffset(). 
E.g. in CharTokenizer the implementations looks like this:

{code:java}
  public void end() {
// set final offset
int finalOffset = input.correctOffset(offset);
offsetAtt.setOffset(finalOffset, finalOffset);
  }
{code}

I changed DocInverterPerField to call end() after the stream is fully consumed 
and use what 
offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API


[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734024#action_12734024
 ] 

Michael Busch commented on LUCENE-1460:
---

Cool! Thanks, Simon.

> Change all contrib TokenStreams/Filters to use the new TokenStream API
> --
>
> Key: LUCENE-1460
> URL: https://issues.apache.org/jira/browse/LUCENE-1460
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
> LUCENE-1460_partial.txt
>
>
> Now that we have the new TokenStream API (LUCENE-1422) we should change all 
> contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734023#action_12734023
 ] 

Michael Busch commented on LUCENE-1448:
---

Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter.

But it should be simple: CachingTokenFilter.end() should call input.end() when 
it is called for the first time and store the captured state locally as 
finalState. 
Then whenever CachingTokenFilter.end() is called again, it just restores the
finalState.

For Tee/Sink it should work similarly: The tee just puts a finalState into the
sink(s) the first time end() is called. And when end() of a sink is called it 
restores the finalState.

This should work?

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734025#action_12734025
 ] 

Michael Busch commented on LUCENE-1448:
---

Hmm another reason why I don't like two Tees feeding one Sink:

What is the finalOffset and finalState then?

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1693:
--

Description:
This patch makes the following improvements to AttributeSource and
TokenStream/Filter:

- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.

- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface->instance mappings
to the attribute map for each of the found interfaces.

- removes the set/getUseNewAPI() methods (including the standard
ones). Instead it is now enough to only implement the new API, if one old
TokenStream implements still the old API (next()/next(Token)), it is wrapped
automatically. The delegation path is determined via reflection (the patch
determines, which of the three methods was overridden).

- Token is no longer deprecated, instead it implements all 6 standard token
interfaces (see above). The wrapper for next() and next(Token) uses this, to
automatically map all attribute interfaces to one TokenWrapper instance
(implementing all 6 interfaces), that contains a Token instance. next() and
next(Token) exchange the inner Token instance as needed. For the new
incrementToken(), only one TokenWrapper instance is visible, delegating to the
currect reusable Token. This API also preserves custom Token subclasses, that
maybe created by very special token streams (see example in Backwards-Test).

- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.

- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.

The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning.

This issue contains one backwards-compatibility break:
TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for
the explaination). Some of these core classes are not final and so one could
override the next() or next(Token) methods. In this case, the backwards-wrapper
would automatically use incrementToken(), because it is implemented, so the
overridden method is never called. To prevent users from errors not visible
during compilation or testing (the streams just behave wrong), this patch makes
all implementation methods final (next(), next(Token), incrementToken()),
whenever the class itsself is not final. This is a BW break, but users will
clearly see, that they have done something unsupoorted and should better create
a custom TokenFilter with their additional implementation (instead of extending
a core implementation).

For further changing contrib token streams the following procedere should be
used:

* rewrite and replace next(Token)/next() implementations by new API
* if the cl

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1693:
--

Description:
This patch makes the following improvements to AttributeSource and
TokenStream/Filter:

- removes the set/getUseNewAPI() methods (including the standard
ones). Instead it is now enough to only implement the new API,
if one old TokenStream implements still the old API (next()/next(Token)),
it is wrapped automatically. The delegation path is determined via
reflection (the patch determines, which of the three methods was
overridden).

- Token is no longer deprecated, instead it implements all 6 standard
token interfaces (see above). The wrapper for next() and next(Token)
uses this, to automatically map all attribute interfaces to one
TokenWrapper instance (implementing all 6 interfaces), that contains
a Token instance. next() and next(Token) exchange the inner Token
instance as needed. For the new incrementToken(), only one
TokenWrapper instance is visible, delegating to the currect reusable
Token. This API also preserves custom Token subclasses, that maybe
created by very special token streams (see example in Backwards-Test).

- Tee- and SinkTokenizer were deprecated, because they use
Token instances for caching. This is not compatible to the new API
using AttributeSource.State objects. You can still use the old
deprecated ones, but new features provided by new Attribute types
may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
which has a factory to create new Sink instances, that have compatible
attributes. Sink instances created by one Tee can also be added to
another Tee, as long as the attribute implementations are compatible
(it is not possible to add a sink from a tee using one Token instance
to a tee using the six separate attribute impls). In this case UOE is thrown.

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734063#action_12734063
 ] 

Uwe Schindler commented on LUCENE-1448:
---

This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the tee, the 
second one overwrites the first one).

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734063#action_12734063
 ] 

Uwe Schindler edited comment on LUCENE-1448 at 7/22/09 3:25 AM:


This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the sink, 
the second one overwrites the first one).

  was (Author: thetaphi):
This is not the only problem with multiple Tees: The offsets are also 
completely mixed together, especially if the two tees feed into the sink at the 
same time (not after each other). In my opinion, the last call to end should be 
cached by the sink as end state (so if two tees add a end state to the tee, the 
second one overwrites the first one).
  
> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734067#action_12734067
 ] 

Uwe Schindler commented on LUCENE-1644:
---

Sorry that I came back too late to this issue, I am in holidays at the moment.

In my opinion, the Parameter instead of boolean is a good idea. The latest 
patch is also a good idea, I only hve some small problems with it:
- Why did you make so many internal things public? The additional ctor to 
MultiTermQueryrapperFilter should be package-private or protected (the class is 
not abstract, but should be used like abstract, so it ,must have only protected 
ctors). Only the public instances TermRangeFilter should have public ctors.
- getFilter()/getEnum should stay protected.
- I do not like the wired caching of Terms. A more cleaner API would be a new 
class CachingFilteredTermEnum, that can turn on caching for e.g. the first 20 
terms and then reset. In this case, the API would stay clear and the filter 
code does not need to be changed at all (it just harvests the TermEnum, if it 
is cached or not). I would propose something like: new 
CachingFilteredTermEnum(originalEnum), use it normally, then termEnum.reset() 
to consume again and termEnum.purgeCache() if caching no longer needed and to 
be switched off (after the first 25 terms or so). The problem with 
MultiTermQueryWrapper filter is, that the filter is normally stateless (no 
reader or termenum). So normally the method getDocIdSet() should get the 
termenum or wrapper in addition to the indexreader. This is not very good (it 
took me some time, to understand, what you are doing). 

> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734070#action_12734070
 ] 

Uwe Schindler commented on LUCENE-1644:
---

The biggest problem is, that this caching gets completely wired with 
multi-segment indexes:
The rewriting is done on the top-level reader. In this case the boolean query 
would be built and the terms cached. If there are too many terms, it creates a 
filter instance with the cached terms.
The rewritten query is then executed against all sub-readers using the cached 
terms and a fixed term enum. Normally this would create a docidset for the 
current index reader, the rewrite did it for the top-level index reader -> the 
wron doc ids are returned and so on. So you cannot reuse the collected terms 
from the rewrite operation in the getDocIdSet calls.

So please turn of this caching at all! As noted before, the important thing is, 
that the returned filter by rewrite is stateless and should not know anythis 
about index readers. The idex reader is passed in getDocIdSet any is different 
for non-optimized indexes.

You have seen no tests fail, because all RangeQuery tests use optimized indexes.

> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[ApacheCon US] Travel Assistance

2009-07-22 Thread Grant Ingersoll

The Travel Assistance Committee is taking in applications for those  
wanting
to attend ApacheCon US 2009 (Oakland) which takes place between the  
2nd and

6th November 2009.

The Travel Assistance Committee is looking for people who would like  
to be

able to attend ApacheCon US 2009 who may need some financial support in
order to get there. There are limited places available, and all  
applications
will be scored on their individual merit. Applications are open to all  
open
source developers who feel that their attendance would benefit  
themselves,

their project(s), the ASF and open source in general.

Financial assistance is available for flights, accommodation,  
subsistence
and Conference fees either in full or in part, depending on  
circumstances.

It is intended that all our ApacheCon events are covered, so it may be
prudent for those in Europe and/or Asia to wait until an event closer to
them comes up - you are all welcome to apply for ApacheCon US of  
course, but
there should be compelling reasons for you to attend an event further  
away
that your home location for your application to be considered above  
those

closer to the event location.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a  
link to

the online application and details for submitting.

Applications for applying for travel assistance will open on 27th July  
2009

and close of the 17th August 2009.

Good luck to all those that will apply.

Regards,

The Travel Assistance Committee

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734096#action_12734096
 ] 

Michael McCandless commented on LUCENE-1644:


bq. The biggest problem is, that this caching gets completely wired with 
multi-segment indexes

Right, I caught this as well (there is one test that fails when I forcefully 
swap in constant-boolean-query as the constant score method), and I'm now 
turning off the caching.

I've fixed it locally -- will post a new rev soon.

> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-07-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734097#action_12734097
 ] 

Robert Muir commented on LUCENE-1460:
-

Michael, after 1728 I can take another look at this. the reason is, that I 
added some tests to these analyzers and found a bug in the Thai offsets.

When i submitted this, i only duplicated the existing behavior, but I don't 
want to reintroduce the bug into incrementToken()


> Change all contrib TokenStreams/Filters to use the new TokenStream API
> --
>
> Key: LUCENE-1460
> URL: https://issues.apache.org/jira/browse/LUCENE-1460
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
> LUCENE-1460_partial.txt
>
>
> Now that we have the new TokenStream API (LUCENE-1422) we should change all 
> contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Alves updated LUCENE-1486:
---

Attachment: junit_complex_phrase_qp_07_22_2009.patch

I added 2 testcases that return doc 3, but do not make much sense just to prove 
the point that we need more docs describing the use case for complex phrase qp, 
and define what is the subset of the supported syntax we want to support.

checkMatches("\"(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND, NOT
checkMatches("\"(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\"~3","3"); // proximity with fuzzy, OR, AND


> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734141#action_12734141
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 7:55 AM:
-

I added 2 testcases that return doc 3.
These queries do not make much sense,
I added it just to prove the point that we need more information
describing the use case for complex phrase qp.
We also should define a subset of the supported syntax we want to support 
inside phrases, 
with well defined behaviors.

checkMatches("\"(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND, NOT
checkMatches("\"(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\"~3","3"); // proximity with fuzzy, OR, AND


  was (Author: lafa):
I added 2 testcases that return doc 3, but do not make much sense just to 
prove the point that we need more docs describing the use case for complex 
phrase qp, and define what is the subset of the supported syntax we want to 
support.

checkMatches("\"(goos~0.5 AND (mike OR smith) AND NOT ( percival AND 
john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND, NOT
checkMatches("\"(goos~0.5 AND (mike OR smith) AND ( percival AND john) 
) vacation\"~3","3"); // proximity with fuzzy, OR, AND

  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734015#action_12734015
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 7:57 AM:
-

I share same opinion as Michael,
the implementation has a lot of undefined/undocumented behaviors,
simple because it reuses the queryparser to parse the text inside a phrase. 
All the lucene syntax needs to be accounted on this design, but it does not 
seem to be the case.

Problems like Adriano described, phrase inside a phrase, position reporting for 
errors.

I also have a lot of concerns about having the full lucene syntax inside 
phrases 
and trying to restrict this by throwing exceptions for particular cases does 
not seem the best design.

Here is a example of with OR, AND, PARENTESIS with a proximity search
"(( jakarta OR green) AND (blue AND orange)  AND black~0.5) apache"~10

What should a user expect from this query, without looking at the code. I'm not 
sure.
Does it even make sense to support this complex syntax? In my opinion. no

I think we should define what is the subset of the language we want to support 
inside the phrases with a well defined behavior.
If Mark describes all the syntax he wants to support inside phrases, I actually 
don't mind to implement a new parser.for this.

My view is, contrib is probably a better place to have this code, until we 
figure out a implementation that does not impose as many restrictions on 
changes to the original queryparser and describes a well defined syntax to be 
applied inside phrases.



  was (Author: lafa):
I share same opinion as Michael,
the implementation has a lot of undefined/undocumented behaviors,
simple because it reuses the queryparser to parse the text inside a phrase. 
All the lucene syntax needs to be accounted on this design, but it does not 
seem to be the case.

Problems like Adriano described, phrase inside a phrase, position reporting for 
errors.

I also have a lot of concerns about having the full lucene syntax inside 
phrases 
and trying to restrict this by throwing exceptions for particular cases does 
not seem the best design.

Here is a example of with OR, AND, PARENTESIS with a proximity search
"(( jakarta OR green) AND (blue AND orange)  AND black~2) apache"~10

What should a user expect from this query, without looking at the code. I'm not 
sure.
Does it even make sense to support this complex syntax? In my opinion. no

I think we should define what is the subset of the language we want to support 
inside the phrases with a well defined behavior.
If Mark describes all the syntax he wants to support inside phrases, I actually 
don't mind to implement a new parser.for this.

My view is, contrib is probably a better place to have this code, until we 
figure out a implementation that does not impose as many restrictions on 
changes to the original queryparser and describes a well defined syntax to be 
applied inside phrases.


  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-uns

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734146#action_12734146
 ] 

Shai Erera commented on LUCENE-1076:


Thanks for the education everyone.

Mike - it feels to me, even though I can't pin point it at the moment 
(FieldCache maybe?), that if maxDoc won't reflect the number of documents in 
the index we'll run into troubles. Therefore I suggest you consider introducing 
another numDocs() method which returns the actual number of documents there are 
in the index.

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734148#action_12734148
 ] 

Mark Harwood commented on LUCENE-1486:
--

I'll try and catch up with some of the issues raised here:

bq. What do you mean on the last check by phrase inside phrase, I don't see any 
phrase inside a phrase

Correct, the "inner phrase" example was a term not a phrase. This is perhaps a 
better example:

checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside 
phrases is bad

bq. I'm trying now to figure out what is supported 

The Junit is currently the main form of documentation - unlike the 
XMLQueryParser (which has a DTD) there is no syntax to formally capture the 
logic. 
Here is a basic summary of the syntax supported and how it differs from normal 
non-phrase use of the same operators:

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given 
phrase element  e.g. "(john OR jonathon) smith" 
* "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding 
all phrase elements 

To move this forward I would suggest we consider following one of these options:

1) Keep in core and improve error reporting and documentation
2) Move into "contrib" as experimental 
3) Retain in core but simplify it to support only the simplest syntax (as in my 
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable "within phrase" operators e.g. *, ~, ( ) 

I think 1) is achievable if we carefully define where the existing parser 
breaks (e.g. ANDs and nested brackets)
2) is unnecessary if we can achieve 1).
3) would be a shame if we lost useful features for some very convoluted edge 
cases
4) is beyond my JavaCC skills.



















> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Java caching of low-level index data?

2009-07-22 Thread Nigel

In discussions of Lucene search performance, the importance of OS caching of
index data is frequently mentioned.  The typical recommendation is to keep
plenty of unallocated RAM available (e.g. don't gobble it all up with your
JVM heap) and try to avoid large I/O operations that would purge the OS
cache.

I'm curious if anyone has thought about (or even tried) caching the
low-level index data in Java, rather than in the OS.  For example, at the
IndexInput level there could be an LRU cache of byte[] blocks, similar to
how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
reads in 1k chunks.) You would reverse the advice above and instead make
your JVM heap as large as possible (or at least large enough to achieve a
desired speed/space tradeoff).

This approach seems like it would have some advantages:

- Explicit control over how much you want cached (adjust your JVM heap and
cache settings as desired)
- Cached index data won't be purged by the OS doing other things
- Index warming might be faster, or at least more predictable

The obvious disadvantage for some situations is that more RAM would now be
tied up by the JVM, rather than managed dynamically by the OS.

Any thoughts?  It seems like this would be pretty easy to implement
(subclass FSDirectory, return subclass of FSIndexInput that checks the cache
before reading, cache keyed on filename + position), but maybe I'm
oversimplifying, and for that matter a similar implementation may already
exist somewhere for all I know.

Thanks,
Chris

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734149#action_12734149
 ] 

Shai Erera commented on LUCENE-1754:


Any thoughts on that?

If we keep null, then I'll fix IndexSearcher to check whether 
filter.getDocIdSet did not return null. If it did, don't execute the query.

I'd like to move on with this, if we have some sort of consensus.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734150#action_12734150
 ] 

Mark Miller commented on LUCENE-1486:
-

My first thought is, if we can address some of the issues brought up, there is 
no reason to keep this out of core IMHO.

My second thought is, I have a feeling a lot of this concern stems from the 
fact that these guys (or one of them) has to duplicate this thing with the 
QueryParser code in contrib. That could be reason enough to move it to contrib. 
But it doesn't solve the issue longer term when the old QueryParser is removed. 
It would need to be replaced then, or dropped from contrib.

With the new info from Mark H, how hard would it be to create a new imp for the 
new parser that did a lot of this, in a more defined way? It seems you 
basically just want to be able to use multiterm queries and group/or things, 
right? We could even relax a little if we have to. This hasn't been released, 
so there is still a lot of wiggle room I think. But there does have to be a 
resolution with this and the new parser at some point either way.

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2

2009-07-22 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734154#action_12734154
 ] 

Tim Smith commented on LUCENE-1754:
---

keeping null should be fine, as long as this is documented and all core query 
implementations handle this behavior, and all searcher code handles the null 
return properly
at this point, NonMatchingScorer could be removed and null returned in its 
place (being package private, no one writing applications can make any 
assumptions on a NonMatchingScorer being returned)

however, this should also be documented for the rewrite() method (currently 
this looks to always expect a non-null return value), also the searcher will 
throw null pointers if a null query is passed to it 



> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Java caching of low-level index data?

2009-07-22 Thread Shai Erera

That's an interesting idea.

I always wonder however how much exactly would we gain, vs. the effort spent
to develop, debug and maintain it. Just some thoughts that we should
consider regarding this:

* For very large indices, where we think this will generally be good for, I
believe it's reasonable to assume that the search index will sit on its own
machine, or set of CPUs, RAM and HD. Therefore given that very few will run
on the OS other than the search index, I assume the OS cache will be enough
(if not better)?

* In other cases, where the search app runs together w/ other apps, I'm not
sure how much we'll gain. I can assume such apps will use a smaller index,
or will not need to support high query load? If so, will they really care if
we cache their data, vs. the OS?

Like I said, these are just thoughts. I don't mean to cancel the idea w/
them, just to think how much will it improve performance (vs. maybe even
hurt it?). Often I find it that some optimizations that are done will
benefit very large indices. But these usually get their decent share of
resources, and the JVM itself is run w/ larger heap etc. So these
optimizations turn out to not affect such indices much after all. And for
smaller indices, performance is usually not a problem (well ... they might
just fit entirely in RAM).

Shai

On Wed, Jul 22, 2009 at 6:21 PM, Nigel  wrote:

> In discussions of Lucene search performance, the importance of OS caching
> of index data is frequently mentioned.  The typical recommendation is to
> keep plenty of unallocated RAM available (e.g. don't gobble it all up with
> your JVM heap) and try to avoid large I/O operations that would purge the OS
> cache.
>
> I'm curious if anyone has thought about (or even tried) caching the
> low-level index data in Java, rather than in the OS.  For example, at the
> IndexInput level there could be an LRU cache of byte[] blocks, similar to
> how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
> reads in 1k chunks.) You would reverse the advice above and instead make
> your JVM heap as large as possible (or at least large enough to achieve a
> desired speed/space tradeoff).
>
> This approach seems like it would have some advantages:
>
> - Explicit control over how much you want cached (adjust your JVM heap and
> cache settings as desired)
> - Cached index data won't be purged by the OS doing other things
> - Index warming might be faster, or at least more predictable
>
> The obvious disadvantage for some situations is that more RAM would now be
> tied up by the JVM, rather than managed dynamically by the OS.
>
> Any thoughts?  It seems like this would be pretty easy to implement
> (subclass FSDirectory, return subclass of FSIndexInput that checks the cache
> before reading, cache keyed on filename + position), but maybe I'm
> oversimplifying, and for that matter a similar implementation may already
> exist somewhere for all I know.
>
> Thanks,
> Chris
>

[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class


[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734156#action_12734156
 ] 

Shai Erera commented on LUCENE-1720:


Hey Mark. Have you made any progress with that? We can tag team if you want

> TimeLimitedIndexReader and associated utility class
> ---
>
> Key: LUCENE-1720
> URL: https://issues.apache.org/jira/browse/LUCENE-1720
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Attachments: ActivityTimedOutException.java, 
> ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
> TimeLimitedIndexReader.java
>
>
> An alternative to TimeLimitedCollector that has the following advantages:
> 1) Any reader activity can be time-limited rather than just single searches 
> e.g. the document retrieve phase.
> 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
> before last "collect" stage of query processing)
> Uses new utility timeout class that is independent of IndexReader.
> Initial contribution includes a performance test class but not had time as 
> yet to work up a formal Junit test.
> TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734164#action_12734164
 ] 

Michael McCandless commented on LUCENE-1754:


I think we continue to allow scorer() and getDocIdSet to return null to mean 
"no matches", though they are not required too (ie, one cannot assume that a 
non-null return means there are matches).

And we should make this clear in the javadocs.

And remove NonMatchingScorer.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734168#action_12734168
 ] 

Shai Erera commented on LUCENE-1754:


ok then I'll add a test case to the patch which uses QWF w/ a query that it's 
scorer returns null, and then fix IndexSearcher accordingly. And update the 
javadocs as needed.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Java caching of low-level index data?

2009-07-22 Thread eks dev

imo, it is too low level to do it better than OSs. I agree, cache unloading 
effect would be prevented with it, but I am not sure if it brings net-net 
benefit, you would get this problem fixed, but probably OS would kill you 
anyhow (you took valuable memory from OS) on queries that miss your internal 
cache...  

We  could try to do better if we put more focus on higher levels and do the 
caching there... maybe even cache somhow some CPU work, e.g.  keep dense 
Postings in "faster, less compressed" format, load TermDictionary into 
RAMDirectory and keep the rest on disk.. Ideas in that direction have better 
chance to bring us forward. Take for example FuzzyQuery, there you can do some 
LRU caching at Term level and and save huge amounts of IO and CPU... 





>
>From: Shai Erera 
>To: java-dev@lucene.apache.org
>Sent: Wednesday, 22 July, 2009 17:32:34
>Subject: Re: Java caching of low-level index data?
>
>
>That's an interesting idea.
>
>I always wonder however how much exactly would we gain, vs. the effort spent 
>to develop, debug and maintain it. Just some thoughts that we should consider 
>regarding this:
>
>* For very large indices, where we think this will generally be good for, I 
>believe it's reasonable to assume that the search index will sit on its own 
>machine, or set of CPUs, RAM and HD. Therefore given that very few will run on 
>the OS other than the search index, I assume the OS cache will be enough (if 
>not better)?
>
>* In other cases, where the search app runs together w/ other apps, I'm not 
>sure how much we'll gain. I can assume such apps will use a smaller index, or 
>will not need to support high query load? If so, will they really care if we 
>cache their data, vs. the OS?
>
>Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, 
>just to think how much will it improve performance (vs. maybe even hurt it?). 
>Often I find it that some optimizations that are done will benefit very large 
>indices. But these usually get their decent share of resources, and the JVM 
>itself is run w/ larger heap etc. So these optimizations turn out to not 
>affect such indices much after all. And for smaller indices, performance is 
>usually not a problem (well ... they might just fit entirely in RAM).
>
>Shai
>
>
>On Wed, Jul 22, 2009 at 6:21 PM, Nigel  wrote:
>
>>>In discussions of Lucene search performance, the importance of OS caching of 
>>>index data is frequently mentioned.  The typical recommendation is to keep 
>>>plenty of unallocated RAM available (e.g. don't gobble it all up with your 
>>>JVM heap) and try to avoid large I/O operations that would purge the OS 
>>>cache.
>>
>>I'm curious if anyone has thought about (or even tried) caching the low-level 
>>index data in Java, rather than in the OS.  For example, at the IndexInput 
>>level there could be an LRU cache of byte[] blocks, similar to how a RDBMS 
>>caches index pages.  (Conveniently, BufferedIndexInput already reads in 1k 
>>chunks.) You would reverse the advice above and instead make your JVM heap as 
>>large as possible (or at least large enough to achieve a desired speed/space 
>>tradeoff). 
>>
>>This approach seems like it would have some advantages:
>>
>>- Explicit control over how much you want cached (adjust your JVM heap and 
>>cache settings as desired)
>>- Cached index data won't be purged by the OS doing other things

>>- Index warming might be faster, or at least more predictable
>>
>>The obvious disadvantage for some situations is that more RAM would now be 
>>tied up by the JVM, rather than managed dynamically by the OS.
>>
>>Any thoughts?  It seems like this would be pretty easy to implement (subclass 
>>FSDirectory, return subclass of FSIndexInput that checks the cache before 
>>reading, cache keyed on filename + position), but maybe I'm oversimplifying, 
>>and for that matter a similar implementation may already exist somewhere for 
>>all I know.
>>
>>Thanks,
>>Chris
>>
>

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734169#action_12734169
 ] 

Michael McCandless commented on LUCENE-1076:


maxDoc() does reflect the number of docs in the index.  It's simply the sum of 
docCount for all segments.  Shuffling the order of the segments, or allowing 
non-contiguous segments to be merged, won't change how maxDoc() is computed.

New docIDs are allocating by incrementing an integer (starting with 0) for the 
buffered docs.  When a segment gets flushed, we reset that to 0.  Ie, docIDs 
are stored within one segment; they have no "context" from prior segments.

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734174#action_12734174
 ] 

Shai Erera commented on LUCENE-1076:


Oh. Thanks for correcting me. In that case, I take what I said back.

I think this together w/ LUCENE-1750 can really help speed up segment merges in 
certain scenarios. Will wait for you to come back to it :)

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class


[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734176#action_12734176
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Hey Mark. Have you made any progress with that?

Apologies, recently the lure of developing apps for the new iPhone has put paid 
to that :)

I'm still happy that the pseudo-code we outlined in the last couple of comments 
is what is needed to finish this.

bq.We can tag team if you want 

Yep, happy to do that. Let me know if you start work to avoid me duplicating 
effort and I'll do the same.

Cheers
Mark



> TimeLimitedIndexReader and associated utility class
> ---
>
> Key: LUCENE-1720
> URL: https://issues.apache.org/jira/browse/LUCENE-1720
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Attachments: ActivityTimedOutException.java, 
> ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
> TimeLimitedIndexReader.java
>
>
> An alternative to TimeLimitedCollector that has the following advantages:
> 1) Any reader activity can be time-limited rather than just single searches 
> e.g. the document retrieve phase.
> 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
> before last "collect" stage of query processing)
> Uses new utility timeout class that is independent of IndexReader.
> Initial contribution includes a performance test class but not had time as 
> yet to work up a formal Junit test.
> TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Java caching of low-level index data?

I think it's a neat idea!

But you are in fact fighting the OS so I'm not sure how well this'll
work in practice.

EG the OS will happily swap out pages from your process if it thinks
you're not using them, so it'd easily swap out your cache in favor of
its own IO cache (this is the "swappiness" configuration on Linux),
which would then kill performance (take a page hit when you finally
did need to use your cache).  In C (possibly requiring root) you could
wire the pages, but we can't do that from javaland, so it's already
not a fair fight.

Mike

On Wed, Jul 22, 2009 at 11:56 AM, eks dev wrote:
> imo, it is too low level to do it better than OSs. I agree, cache unloading
> effect would be prevented with it, but I am not sure if it brings net-net
> benefit, you would get this problem fixed, but probably OS would kill you
> anyhow (you took valuable memory from OS) on queries that miss your internal
> cache...
>
> We could try to do better if we put more focus on higher levels and do the
> caching there... maybe even cache somhow some CPU work, e.g.  keep dense
> Postings in "faster, less compressed" format, load TermDictionary into
> RAMDirectory and keep the rest on disk.. Ideas in that direction have better
> chance to bring us forward. Take for example FuzzyQuery, there you can do
> some LRU caching at Term level and and save huge amounts of IO and CPU...
>
>
>
>
> From: Shai Erera 
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 22 July, 2009 17:32:34
> Subject: Re: Java caching of low-level index data?
>
> That's an interesting idea.
>
> I always wonder however how much exactly would we gain, vs. the effort spent
> to develop, debug and maintain it. Just some thoughts that we should
> consider regarding this:
>
> * For very large indices, where we think this will generally be good for, I
> believe it's reasonable to assume that the search index will sit on its own
> machine, or set of CPUs, RAM and HD. Therefore given that very few will run
> on the OS other than the search index, I assume the OS cache will be enough
> (if not better)?
>
> * In other cases, where the search app runs together w/ other apps, I'm not
> sure how much we'll gain. I can assume such apps will use a smaller index,
> or will not need to support high query load? If so, will they really care if
> we cache their data, vs. the OS?
>
> Like I said, these are just thoughts. I don't mean to cancel the idea w/
> them, just to think how much will it improve performance (vs. maybe even
> hurt it?). Often I find it that some optimizations that are done will
> benefit very large indices. But these usually get their decent share of
> resources, and the JVM itself is run w/ larger heap etc. So these
> optimizations turn out to not affect such indices much after all. And for
> smaller indices, performance is usually not a problem (well ... they might
> just fit entirely in RAM).
>
> Shai
>
> On Wed, Jul 22, 2009 at 6:21 PM, Nigel  wrote:
>>
>> In discussions of Lucene search performance, the importance of OS caching
>> of index data is frequently mentioned.  The typical recommendation is to
>> keep plenty of unallocated RAM available (e.g. don't gobble it all up with
>> your JVM heap) and try to avoid large I/O operations that would purge the OS
>> cache.
>>
>> I'm curious if anyone has thought about (or even tried) caching the
>> low-level index data in Java, rather than in the OS.  For example, at the
>> IndexInput level there could be an LRU cache of byte[] blocks, similar to
>> how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
>> reads in 1k chunks.) You would reverse the advice above and instead make
>> your JVM heap as large as possible (or at least large enough to achieve a
>> desired speed/space tradeoff).
>>
>> This approach seems like it would have some advantages:
>>
>> - Explicit control over how much you want cached (adjust your JVM heap and
>> cache settings as desired)
>> - Cached index data won't be purged by the OS doing other things
>> - Index warming might be faster, or at least more predictable
>>
>> The obvious disadvantage for some situations is that more RAM would now be
>> tied up by the JVM, rather than managed dynamically by the OS.
>>
>> Any thoughts?  It seems like this would be pretty easy to implement
>> (subclass FSDirectory, return subclass of FSIndexInput that checks the cache
>> before reading, cache keyed on filename + position), but maybe I'm
>> oversimplifying, and for that matter a similar implementation may already
>> exist somewhere for all I know.
>>
>> Thanks,
>> Chris
>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


 [ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1644:
---

Attachment: LUCENE-1644.patch

Attached patch: fixed some bugs in the last rev, updated test cases,
javadocs, CHANGES.  I also optimized MultiTermQueryWrapperFilter to
use the bulk-read API from termDocs.

I confirmed all tests pass if I temporarily switch
CONSTANT_SCORE_FILTER_REWRITE to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT.

I changed QueryParser to use CONSTANT_SCORE_AUTO for rewrite (it was
previously CONSTANT_FILTER).

I still need to run some perf tests to get a rough sense of decent
defaults for CONSTANT_SCORE_AUTO cutover thresholds.

bq. getFilter()/getEnum should stay protected.

OK I made getEnum protected again.

I had tentatively made it public so that one could create their own
[external] rewrite methods.  But I think (if we leave it protected),
one could still make an inner/nested class that can access getEnum().

Do we even need getFilter()?  I removed it in the patch.


> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


 [ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1076:
--

Assignee: (was: Michael McCandless)

Unassigning myself.

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734190#action_12734190
 ] 

Michael McCandless commented on LUCENE-1076:


bq. Will wait for you to come back to it

Feel free to take it, too :)

I think LUCENE-1737 is also very important for speeding up merging, especially 
because it's so "unexpected" that just by adding different fields to your docs, 
or the same fields in different orders, can so severely impact merge 
performance.

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges


[ 
https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734194#action_12734194
 ] 

Shai Erera commented on LUCENE-1076:


bq. Feel free to take it, too

I don't mind to take a stab at it. But this doesn't mean you can unassign 
yourself. I'll need someone to commit it :).

> Allow MergePolicy to select non-contiguous merges
> -
>
> Key: LUCENE-1076
> URL: https://issues.apache.org/jira/browse/LUCENE-1076
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


 [ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1754:
---

Attachment: LUCENE-1754.patch

* Added a test case to TestDocIdSet which returns a null DocIdSet and indeed 
IndexSearcher failed.
* Fixed IndexSearcher and all other places in the code which called scorer() or 
getDocIdSet() and could potentially hit NPE.
* Added EmptyDocIdSetIterator for use by classes (such as ChainFilter) that 
need a DISI, but got a null DocIdSet.
* Updated CHANGES.

All search tests pass.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch, LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Java caching of low-level index data?

2009-07-22 Thread eks dev


this should not be all that difficult to try. I accept it makes sense in some 
cases ... but which ones?
Background: all my attempts to fight OS went bed :( 

Let us think again what does it mean what Mike gave as an example?

You are explicitly deciding that Lucene should get bigger share of RAM. OS will 
unload these pages 
 if OS needs Lucene  RAM for "something else" and you are not using them. Right?

If "something else" should get less resources, we are on target, but this is 
end result. For any shared setup where you have many things that run, this 
decision has its consequences, "something else" is going to be starved. 

The other case, where only lucene runs, well what is the difference if we evict 
unused pages or OS does it (better control is just what we get on benefit)? 
This is the case where you are anyhow in "not really comfortable for real 
caching" situation, otherwise even greedy OSs wouldn't swap (at least my 
experience with reasonably configured OSs)... 

after thinking about it again, I would say, yes, there are for sure some cases 
where it helps, but not many cases and even in these cases benefit will be 
small.

I guess :)






- Original Message 
> From: Michael McCandless 
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 22 July, 2009 18:37:19
> Subject: Re: Java caching of low-level index data?
> 
> I think it's a neat idea!
> 
> But you are in fact fighting the OS so I'm not sure how well this'll
> work in practice.
> 
> EG the OS will happily swap out pages from your process if it thinks
> you're not using them, so it'd easily swap out your cache in favor of
> its own IO cache (this is the "swappiness" configuration on Linux),
> which would then kill performance (take a page hit when you finally
> did need to use your cache).  In C (possibly requiring root) you could
> wire the pages, but we can't do that from javaland, so it's already
> not a fair fight.
> 
> Mike
> 
> On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
> > imo, it is too low level to do it better than OSs. I agree, cache unloading
> > effect would be prevented with it, but I am not sure if it brings net-net
> > benefit, you would get this problem fixed, but probably OS would kill you
> > anyhow (you took valuable memory from OS) on queries that miss your internal
> > cache...
> >
> > We could try to do better if we put more focus on higher levels and do the
> > caching there... maybe even cache somhow some CPU work, e.g.  keep dense
> > Postings in "faster, less compressed" format, load TermDictionary into
> > RAMDirectory and keep the rest on disk.. Ideas in that direction have better
> > chance to bring us forward. Take for example FuzzyQuery, there you can do
> > some LRU caching at Term level and and save huge amounts of IO and CPU...
> >
> >
> >
> >
> > From: Shai Erera 
> > To: java-dev@lucene.apache.org
> > Sent: Wednesday, 22 July, 2009 17:32:34
> > Subject: Re: Java caching of low-level index data?
> >
> > That's an interesting idea.
> >
> > I always wonder however how much exactly would we gain, vs. the effort spent
> > to develop, debug and maintain it. Just some thoughts that we should
> > consider regarding this:
> >
> > * For very large indices, where we think this will generally be good for, I
> > believe it's reasonable to assume that the search index will sit on its own
> > machine, or set of CPUs, RAM and HD. Therefore given that very few will run
> > on the OS other than the search index, I assume the OS cache will be enough
> > (if not better)?
> >
> > * In other cases, where the search app runs together w/ other apps, I'm not
> > sure how much we'll gain. I can assume such apps will use a smaller index,
> > or will not need to support high query load? If so, will they really care if
> > we cache their data, vs. the OS?
> >
> > Like I said, these are just thoughts. I don't mean to cancel the idea w/
> > them, just to think how much will it improve performance (vs. maybe even
> > hurt it?). Often I find it that some optimizations that are done will
> > benefit very large indices. But these usually get their decent share of
> > resources, and the JVM itself is run w/ larger heap etc. So these
> > optimizations turn out to not affect such indices much after all. And for
> > smaller indices, performance is usually not a problem (well ... they might
> > just fit entirely in RAM).
> >
> > Shai
> >
> > On Wed, Jul 22, 2009 at 6:21 PM, Nigel wrote:
> >>
> >> In discussions of Lucene search performance, the importance of OS caching
> >> of index data is frequently mentioned.  The typical recommendation is to
> >> keep plenty of unallocated RAM available (e.g. don't gobble it all up with
> >> your JVM heap) and try to avoid large I/O operations that would purge the 
> >> OS
> >> cache.
> >>
> >> I'm curious if anyone has thought about (or even tried) caching the
> >> low-level index data in Java, rather than in the OS.  For example, at the
> >> IndexInput level there could be an LRU cach

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734202#action_12734202
 ] 

Michael McCandless commented on LUCENE-1754:


For some reason I can't apply the patch -- I get this:
{code}
$ patch -p0 < /x/tmp/LUCENE-1754.patch.txt 
patching file CHANGES.txt
patch:  malformed patch at line 20: @@ -629,6 +638,11 @@
{code}

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch, LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


 [ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1754:
---

Attachment: LUCENE-1754.patch

My fault. After I created it, I manually edited the CHANGES section, which 
messed up the lines count.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch, LUCENE-1754.patch, LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734241#action_12734241
 ] 

Adriano Crestani commented on LUCENE-1486:
--

Hi Mark H.,

Thanks for the response, some comments inline:

{quote}
Correct, the "inner phrase" example was a term not a phrase. This is perhaps a 
better example:

checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad
{quote}

I think you did not get what I meant, even with your new example, there is no 
inner phrase, it is: a phrase <"jo* ">, followed by a term , followed 
by another term , and an empty phrase <" ">. So, with your change, the 
junit passes, but for the wrong reason. It gets an exception complaining about 
the empty phrase and not because there is an inner phrase (I still don't see 
how you can type an inner phrase with the current syntax). I think it's not a 
big deal, but I'm just trying to understand and raise a probable wrong test. I 
expect you understood what I mean, let me know if I did not make it clear.

{quote}
The Junit is currently the main form of documentation
{quote}

But not the ideal, because the source code (junit code) is not released in the 
binary release. So, the ideal place should be in the javadocs.

{quote}

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given 
phrase element e.g. "(john OR jonathon) smith"
* "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" 
binding all phrase elements

{quote}

Thanks, now it's clearer for me what is supported or not. I have some questions:

I understand this AND_NEXT_TO implicit operator between the queries inside the 
phrase. However, what happens if the user do not type any explicit boolean 
operator between two terms inside parentheses: "(query parser) lucene". Is the 
operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default 
boolean operator (usually OR)?

What happens if I type "(query AND parser) lucene". In my point of view it is: 
"(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document 
that contains the term 'query' and the term 'parser' in the position x, and the 
term 'lucene' in the position x+1. Is this the expected behaviour?

{quote}
1) Keep in core and improve error reporting and documentation
2) Move into "contrib" as experimental
3) Retain in core but simplify it to support only the simplest syntax (as in my 
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable "within phrase" operators e.g. *, ~, ( )
{quote}

1 is good, but I would prefer 4 too. Documentation and throw the right 
exception are necessary. I just don't feel confortable on the complex phrase 
query parser relying on the main query parser syntax, any change on the main 
one could easialy brake the complex phrase QP. Anyway, 4 may be done in future 
:)

Mark M.:

{quote}
With the new info from Mark H, how hard would it be to create a new imp for the 
new parser that did a lot of this, in a more defined way? It seems you 
basically just want to be able to use multiterm queries and group/or things, 
right? We could even relax a little if we have to. This hasn't been released, 
so there is still a lot of wiggle room I think. But there does have to be a 
resolution with this and the new parser at some point either way.
{quote}

Yes, I am working on the new query parser code. I started recently to read and 
understand how the ComplexPhraseQP works, so I could reproduce the behaviour 
using the new QP framework. I first tried to look at this QP as a user and 
could not figure out what exactly I can or not do with it. I think now we are 
hitting a big problem, which is related to documentation. That is why I started 
raising these question, because others could also have the same issues in 
future.

So, yes, I can start coding some equivalent QP using the new QP framework, I'm 
just questioning and trying to understand everything before I start any coding. 
I don't wanna code anything that wil throw ConcurrentModificationExceptions, 
that's why I'm raising these issues now, before I start moving it to the new QP.

Best Regards,
Adriano Crestani Campos


> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phras

[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2


[ 
https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734245#action_12734245
 ] 

Michael McCandless commented on LUCENE-1754:


OK patch looks good, thanks Shai!

I plan to commit in a day or two.

> Get rid of NonMatchingScorer from BooleanScorer2
> 
>
> Key: LUCENE-1754
> URL: https://issues.apache.org/jira/browse/LUCENE-1754
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1754.patch, LUCENE-1754.patch, LUCENE-1754.patch
>
>
> Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer 
> from BS2, and return null in BooleanWeight.scorer(). I've checked and this 
> can be easily done, so I'm going to post a patch shortly. For reference: 
> https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064.
> I've marked the issue as 2.9 just because it's small, and kind of related to 
> all the search enhancements done for 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Java caching of low-level index data?

Part of the challenge here is what metric is really important.

Eg, as extreme example, imagine a machine that does searching but also
does other things.  The search is not heavily used; in fact people
only run searches from 9 to 5.  So overnight, the OS notices the
search isn't using the RAM at all, and it happily swaps it out and gives
it to other processes, uses it for IO cache, etc.

So then, the first few searches at 9 AM the next day are really slow
as everything gets swapped back in.

>From the OS's standpoint, with the goal of maximizing overall
efficient utilization of the resources, swapping the pages out made
sense, because all these processes overnight ran much more
efficiently.  A few sluggish morning searches was a small price to
pay.

But if consistency of search latency is important, you don't want the
OS to ever do that, and you really need to tune swappiness down, wire
the pages, pre-load your own caches, etc.

Mike

On Wed, Jul 22, 2009 at 1:19 PM, eks dev wrote:
>
> this should not be all that difficult to try. I accept it makes sense in some 
> cases ... but which ones?
> Background: all my attempts to fight OS went bed :(
>
> Let us think again what does it mean what Mike gave as an example?
>
> You are explicitly deciding that Lucene should get bigger share of RAM. OS 
> will unload these pages
>  if OS needs Lucene  RAM for "something else" and you are not using them. 
> Right?
>
> If "something else" should get less resources, we are on target, but this is 
> end result. For any shared setup where you have many things that run, this 
> decision has its consequences, "something else" is going to be starved.
>
> The other case, where only lucene runs, well what is the difference if we 
> evict unused pages or OS does it (better control is just what we get on 
> benefit)? This is the case where you are anyhow in "not really comfortable 
> for real caching" situation, otherwise even greedy OSs wouldn't swap (at 
> least my experience with reasonably configured OSs)...
>
> after thinking about it again, I would say, yes, there are for sure some 
> cases where it helps, but not many cases and even in these cases benefit will 
> be small.
>
> I guess :)
>
>
>
>
>
>
> - Original Message 
>> From: Michael McCandless 
>> To: java-dev@lucene.apache.org
>> Sent: Wednesday, 22 July, 2009 18:37:19
>> Subject: Re: Java caching of low-level index data?
>>
>> I think it's a neat idea!
>>
>> But you are in fact fighting the OS so I'm not sure how well this'll
>> work in practice.
>>
>> EG the OS will happily swap out pages from your process if it thinks
>> you're not using them, so it'd easily swap out your cache in favor of
>> its own IO cache (this is the "swappiness" configuration on Linux),
>> which would then kill performance (take a page hit when you finally
>> did need to use your cache).  In C (possibly requiring root) you could
>> wire the pages, but we can't do that from javaland, so it's already
>> not a fair fight.
>>
>> Mike
>>
>> On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
>> > imo, it is too low level to do it better than OSs. I agree, cache unloading
>> > effect would be prevented with it, but I am not sure if it brings net-net
>> > benefit, you would get this problem fixed, but probably OS would kill you
>> > anyhow (you took valuable memory from OS) on queries that miss your 
>> > internal
>> > cache...
>> >
>> > We could try to do better if we put more focus on higher levels and do the
>> > caching there... maybe even cache somhow some CPU work, e.g.  keep dense
>> > Postings in "faster, less compressed" format, load TermDictionary into
>> > RAMDirectory and keep the rest on disk.. Ideas in that direction have 
>> > better
>> > chance to bring us forward. Take for example FuzzyQuery, there you can do
>> > some LRU caching at Term level and and save huge amounts of IO and CPU...
>> >
>> >
>> >
>> >
>> > From: Shai Erera
>> > To: java-dev@lucene.apache.org
>> > Sent: Wednesday, 22 July, 2009 17:32:34
>> > Subject: Re: Java caching of low-level index data?
>> >
>> > That's an interesting idea.
>> >
>> > I always wonder however how much exactly would we gain, vs. the effort 
>> > spent
>> > to develop, debug and maintain it. Just some thoughts that we should
>> > consider regarding this:
>> >
>> > * For very large indices, where we think this will generally be good for, I
>> > believe it's reasonable to assume that the search index will sit on its own
>> > machine, or set of CPUs, RAM and HD. Therefore given that very few will run
>> > on the OS other than the search index, I assume the OS cache will be enough
>> > (if not better)?
>> >
>> > * In other cases, where the search app runs together w/ other apps, I'm not
>> > sure how much we'll gain. I can assume such apps will use a smaller index,
>> > or will not need to support high query load? If so, will they really care 
>> > if
>> > we cache their data, vs. the OS?
>> >
>> > Like I said, these are just thoughts. I don't

Re: [jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Mark Miller

Another point: I originally didn't think the new parser was going to make 2.9.
Now that it looks like it might, we prob don't want to release a class that
extends the deprecated parser. Getting something similiar with the new parser
would be much preferable - even if it's a bit different.

- Mark

http://www.lucidimagination.com (mobile)

On Jul 22, 2009, at 2:39 PM, "Adriano Crestani (JIRA)" wrote:

[
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734241#action_12734241
]

Adriano Crestani commented on LUCENE-1486:
--

Hi Mark H.,

Thanks for the response, some comments inline:

{quote}
Correct, the "inner phrase" example was a term not a phrase. This is perhaps a
better example:

checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad
{quote}

I think you did not get what I meant, even with your new example, there is no
inner phrase, it is: a phrase <"jo* ">, followed by a term , followed
by another term , and an empty phrase <" ">. So, with your change, the
junit passes, but for the wrong reason. It gets an exception complaining about
the empty phrase and not because there is an inner phrase (I still don't see
how you can type an inner phrase with the current syntax). I think it's not a
big deal, but I'm just trying to understand and raise a probable wrong test. I
expect you understood what I mean, let me know if I did not make it clear.

{quote}
The Junit is currently the main form of documentation
{quote}

But not the ideal, because the source code (junit code) is not released in the
binary release. So, the ideal place should be in the javadocs.

{quote}

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given
phrase element e.g. "(john OR jonathon) smith"
* "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO"
binding all phrase elements

{quote}

Thanks, now it's clearer for me what is supported or not. I have some questions:

I understand this AND_NEXT_TO implicit operator between the queries inside the
phrase. However, what happens if the user do not type any explicit boolean
operator between two terms inside parentheses: "(query parser) lucene". Is the
operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default
boolean operator (usually OR)?

What happens if I type "(query AND parser) lucene". In my point of view it is:
"(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document
that contains the term 'query' and the term 'parser' in the position x, and the
term 'lucene' in the position x+1. Is this the expected behaviour?

{quote}
1) Keep in core and improve error reporting and documentation
2) Move into "contrib" as experimental
3) Retain in core but simplify it to support only the simplest syntax (as in my
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for
acceptable "within phrase" operators e.g. *, ~, ( )
{quote}

1 is good, but I would prefer 4 too. Documentation and throw the right
exception are necessary. I just don't feel confortable on the complex phrase
query parser relying on the main query parser syntax, any change on the main
one could easialy brake the complex phrase QP. Anyway, 4 may be done in future
:)

Mark M.:

{quote}
With the new info from Mark H, how hard would it be to create a new imp for the
new parser that did a lot of this, in a more defined way? It seems you
basically just want to be able to use multiterm queries and group/or things,
right? We could even relax a little if we have to. This hasn't been released,
so there is still a lot of wiggle room I think. But there does have to be a
resolution with this and the new parser at some point either way.
{quote}

Yes, I am working on the new query parser code. I started recently to read and
understand how the ComplexPhraseQP works, so I could reproduce the behaviour
using the new QP framework. I first tried to look at this QP as a user and
could not figure out what exactly I can or not do with it. I think now we are
hitting a big problem, which is related to documentation. That is why I started
raising these question, because others could also have the same issues in
future.

So, yes, I can start coding some equivalent QP using the new QP framework, I'm
just questioning and trying to understand everything before I start any coding.
I don't wanna code anything that wil throw ConcurrentModificationExceptions,
that's why I'm raising these issues now, before I start moving it to the new QP.

Best Regards,
Adriano Crestani Campos

Wildcards, ORs etc inside Phrase queries

Key: LUCENE-1486
URL: https://issues.apache.org/jira/browse/L

Re: Java caching of low-level index data?

2009-07-22 Thread eks dev


>Part of the challenge here is what metric is really important.
Sure, depends who you ask :) Lucene is so popular, that you can find almost 
every pattern we could come up with. 

funny, I had to deal with similar situation. The simplest solution was to set 
warm-up with constructed Queries (from hi-freq terms) well before users start 
shooting... everybody was happy, latency of user requests and OS... even 
funnier, we do it even today with RAMDisk, not to fight OS for RAM, but to 
pre-populate out own app-specific caches after updates/restarts... Good warm-up 
tackles a lot of these problems and is not difficult to do it





- Original Message 
> From: Michael McCandless 
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 22 July, 2009 21:03:00
> Subject: Re: Java caching of low-level index data?
> 
> Part of the challenge here is what metric is really important.
> 
> Eg, as extreme example, imagine a machine that does searching but also
> does other things.  The search is not heavily used; in fact people
> only run searches from 9 to 5.  So overnight, the OS notices the
> search isn't using the RAM at all, and it happily swaps it out and gives
> it to other processes, uses it for IO cache, etc.
> 
> So then, the first few searches at 9 AM the next day are really slow
> as everything gets swapped back in.
> 
> From the OS's standpoint, with the goal of maximizing overall
> efficient utilization of the resources, swapping the pages out made
> sense, because all these processes overnight ran much more
> efficiently.  A few sluggish morning searches was a small price to
> pay.
> 
> But if consistency of search latency is important, you don't want the
> OS to ever do that, and you really need to tune swappiness down, wire
> the pages, pre-load your own caches, etc.
> 
> Mike
> 
> On Wed, Jul 22, 2009 at 1:19 PM, eks devwrote:
> >
> > this should not be all that difficult to try. I accept it makes sense in 
> > some 
> cases ... but which ones?
> > Background: all my attempts to fight OS went bed :(
> >
> > Let us think again what does it mean what Mike gave as an example?
> >
> > You are explicitly deciding that Lucene should get bigger share of RAM. OS 
> will unload these pages
> >  if OS needs Lucene  RAM for "something else" and you are not using them. 
> Right?
> >
> > If "something else" should get less resources, we are on target, but this 
> > is 
> end result. For any shared setup where you have many things that run, this 
> decision has its consequences, "something else" is going to be starved.
> >
> > The other case, where only lucene runs, well what is the difference if we 
> evict unused pages or OS does it (better control is just what we get on 
> benefit)? This is the case where you are anyhow in "not really comfortable 
> for 
> real caching" situation, otherwise even greedy OSs wouldn't swap (at least my 
> experience with reasonably configured OSs)...
> >
> > after thinking about it again, I would say, yes, there are for sure some 
> > cases 
> where it helps, but not many cases and even in these cases benefit will be 
> small.
> >
> > I guess :)
> >
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: Michael McCandless 
> >> To: java-dev@lucene.apache.org
> >> Sent: Wednesday, 22 July, 2009 18:37:19
> >> Subject: Re: Java caching of low-level index data?
> >>
> >> I think it's a neat idea!
> >>
> >> But you are in fact fighting the OS so I'm not sure how well this'll
> >> work in practice.
> >>
> >> EG the OS will happily swap out pages from your process if it thinks
> >> you're not using them, so it'd easily swap out your cache in favor of
> >> its own IO cache (this is the "swappiness" configuration on Linux),
> >> which would then kill performance (take a page hit when you finally
> >> did need to use your cache).  In C (possibly requiring root) you could
> >> wire the pages, but we can't do that from javaland, so it's already
> >> not a fair fight.
> >>
> >> Mike
> >>
> >> On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
> >> > imo, it is too low level to do it better than OSs. I agree, cache 
> >> > unloading
> >> > effect would be prevented with it, but I am not sure if it brings net-net
> >> > benefit, you would get this problem fixed, but probably OS would kill you
> >> > anyhow (you took valuable memory from OS) on queries that miss your 
> internal
> >> > cache...
> >> >
> >> > We could try to do better if we put more focus on higher levels and do 
> >> > the
> >> > caching there... maybe even cache somhow some CPU work, e.g.  keep dense
> >> > Postings in "faster, less compressed" format, load TermDictionary into
> >> > RAMDirectory and keep the rest on disk.. Ideas in that direction have 
> better
> >> > chance to bring us forward. Take for example FuzzyQuery, there you can do
> >> > some LRU caching at Term level and and save huge amounts of IO and CPU...
> >> >
> >> >
> >> >
> >> >
> >> > From: Shai Erera
> >> > To: java-dev@lucene.apache.org
> >> > S

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-22 Thread Dave Been (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734274#action_12734274
 ] 

Dave Been commented on LUCENE-1693:
---

My first post to the list, it appears i should comment here in the JIRA, not 
reply to email, apologize if i did this wrong:

I've been following this AttributeSource/TokenStream patch thread and reviewing 
the changes/backwards compatibility issues and the changes.  
extremely interesting problem/solution.

while looking at Uwe's PerfTest3 I noticed an unused allocation in the last run 
for "reused stream new API only"

 for (int i = 0; i < c; i++) {
if (i==1000) t = System.currentTimeMillis();
tz.reset(new StringReader(text));
// Token reusableToken=new Token();    This one
int num=0;
while (tok.incrementToken()) {
  num++;
}
  }


just a small cost, but makes the new reusable api slightly faster

With extra alloc:

Time for 10 runs with new instances (old API): 12.75s
Time for 10 runs with reused stream (old API): 9.969s
Time for 10 runs with new instances (new API only): 13.969s
Time for 10 runs with reused stream (new API only): 11.735s  
<<


Without extra alloc (changes only the last line's time):

Time for 10 runs with new instances (old API): 12.593s
Time for 10 runs with reused stream (old API): 9.578s
Time for 10 runs with new instances (new API only): 13.75s
Time for 10 runs with reused stream (new API only): 11.453s 
<<



dave

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead it is now enough to only implement the new API,
>   if one old TokenStream implements still the old API (next()/next(Token)),
>   it is wrapped automatically. The delegation path is determined via
>   reflection (the patch determines, which of the three methods was
>   overridden).
> - Token is no longer deprecated, instead it implements all 6 standard
>   token interfaces (see above). The wrapper for next() and next(Token)
>   uses this, to automatically map all attribute interfaces to one
>   TokenWrapper instance (implementing all 6 interfaces), that contains
>   a Token instance. next() and next(Token) exchange the inner Token
>   instance as needed. For the new incrementToken(), only one
>   TokenWrapper instance is visible, delegating to the currect reusable
>   Token. This API also preserves custom Token subclasses, that maybe
>   created by very special token streams (see example in Backwards-Test).
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734280#action_12734280
 ] 

Michael McCandless commented on LUCENE-1448:


This approach (adding end()) sounds good!

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734289#action_12734289
 ] 

Michael Busch commented on LUCENE-1693:
---

Thanks, Dave... I'll remove that unused allocation before committing.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead it is now enough to only implement the new API,
>   if one old TokenStream implements still the old API (next()/next(Token)),
>   it is wrapped automatically. The delegation path is determined via
>   reflection (the patch determines, which of the three methods was
>   overridden).
> - Token is no longer deprecated, instead it implements all 6 standard
>   token interfaces (see above). The wrapper for next() and next(Token)
>   uses this, to automatically map all attribute interfaces to one
>   TokenWrapper instance (implementing all 6 interfaces), that contains
>   a Token instance. next() and next(Token) exchange the inner Token
>   instance as needed. For the new incrementToken(), only one
>   TokenWrapper instance is visible, delegating to the currect reusable
>   Token. This API also preserves custom Token subclasses, that maybe
>   created by very special token streams (see example in Backwards-Test).
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> - Tee- and SinkTokenizer were deprecated, because they use
> Token instances for caching. This is not compatible to the new API
> using AttributeSource.State objects. You can still use the old
> deprecated ones, but new features provided by new Attribute types
> may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
> which has a factory to create new Sink instances, that have compatible
> attributes. Sink instances created by one Tee can also be added to
> another Tee, as long as the attribute implementations are compatible
> (it is not possible to add a sink from a tee using one Token instance
> to a tee using the six separate attribute impls). In this case UOE is thrown.
> The cloning performance can be greatl

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734291#action_12734291
 ] 

Michael Busch commented on LUCENE-1693:
---

OK, I think we're finally ready to commit here!

I'll wait until Friday - if nobody objects until then, I will commit the latest 
patch.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead it is now enough to only implement the new API,
>   if one old TokenStream implements still the old API (next()/next(Token)),
>   it is wrapped automatically. The delegation path is determined via
>   reflection (the patch determines, which of the three methods was
>   overridden).
> - Token is no longer deprecated, instead it implements all 6 standard
>   token interfaces (see above). The wrapper for next() and next(Token)
>   uses this, to automatically map all attribute interfaces to one
>   TokenWrapper instance (implementing all 6 interfaces), that contains
>   a Token instance. next() and next(Token) exchange the inner Token
>   instance as needed. For the new incrementToken(), only one
>   TokenWrapper instance is visible, delegating to the currect reusable
>   Token. This API also preserves custom Token subclasses, that maybe
>   created by very special token streams (see example in Backwards-Test).
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> - Tee- and SinkTokenizer were deprecated, because they use
> Token instances for caching. This is not compatible to the new API
> using AttributeSource.State objects. You can still use the old
> deprecated ones, but new features provided by new Attribute types
> may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
> which has a factory to create new Sink instances, that have compatible
> attributes. Sink instances created by one Tee can also be added to
> another Tee, as long as the attribute implementations are compatible
> (it is not possible to add a sink from a tee using one Token instance
> to a tee using the six separate attribute impls).

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734292#action_12734292
 ] 

Michael Busch commented on LUCENE-1448:
---

Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is 
committed.

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734296#action_12734296
 ] 

Michael Busch commented on LUCENE-1486:
---

I think the best thing to do here is do exactly define what syntax is supposed 
to be supported (which Mark H. did in his latest comment), and then implement 
the new syntax with the new queryparser. It will enforce correct syntax and 
give meaningful exceptions if a query was entered that is not supported.

I think we can still reuse big portions of Mark's patch: we should be able to 
write a new QueryBuilder that produces the new ComplexPhraseQuery.

Adriano/Luis: how long would it take to implement? Can we contain it for 2.9?

This would mean that these new features would go into contrib in 2.9 as part of 
the new query parser framework, and then be moved to core in 3.0. Also from 3.0 
these new features would then be part of Lucene's main query syntax. Would this 
makes sense?

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reopened LUCENE-1486:
---


Reopening this issues; we haven't made a final decision on how we want to go 
forward yet, but in any case there's remaining work here.

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734300#action_12734300
 ] 

Luis Alves commented on LUCENE-1486:


Hi Mark H

I would like to propose 5,
5) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable "within phrase" operators e.g. *, ~, ( ) 
I propose doing this using using the new QP implementation. (I can write 
the new javacc QP for this)
(this implies that the code will be in contrib in 2.9 and be part of core 
on 3.0)

I also want to propose to change the complexphrase to use single quotes,
this way we can have both implementation for phrases.

Here is a summary:
- the complexqueryparser would support all Lucene syntax even for phrases
- and we could add singlequoted text to identify complexphrases 
1) Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
2) Brackets are used to group/define the acceptable variations for a given 
phrase element e.g. "(john OR jonathon) smith"
3) supported operators: OR, *, ~, ( ), ?
4) disallow fields, proximity, boosting and operators on single quoted 
phrases (I'm making an assumption here, Mark H please comment)
5) singlequotes need to be escaped, double quotes will be treated as 
regular punctuation characters inside single quoted strings


Mark H, can you please elaborate more on the these other operators "+" "-" "^" 
"AND" "&&" "||" "NOT" "!" ":" "[" "]" "{" "}".

Example:
A query with single quoted (complexphrase) followed by a term and a normal 
phrase:

query: '(john OR jonathon) smith~0.3 order*' order:sell  "stock market"  



> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734301#action_12734301
 ] 

Uwe Schindler commented on LUCENE-1644:
---

Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public).

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the "singleton" instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734301#action_12734301
 ] 

Uwe Schindler edited comment on LUCENE-1644 at 7/22/09 1:38 PM:


Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public).

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the "singleton" instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct? So one could modify the defaults by setting properties in this 
instance. Is this correct?

  was (Author: thetaphi):
Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public).

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the "singleton" instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).
  
> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734301#action_12734301
 ] 

Uwe Schindler edited comment on LUCENE-1644 at 7/22/09 1:50 PM:


Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public). And the same should be 
for the incNumberOfTerms (also protected). I think the rewrite method is 
internal to MultiTermQuery and always implemented ina subclass of MTQ as inner 
class.

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the "singleton" instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct? So one could modify the defaults by setting properties in this 
instance. Is this correct?

  was (Author: thetaphi):
Hi Mike,

patch looks good. I was a little bit confused about the high term number cut 
off, but it is using Math.max to limit it to the current BooleanQuery max 
clause count.

Some small things:

bq. OK I made getEnum protected again.

...but only in MultiTermQuery itsself. Everywhere else (even in the backwards 
compatibility override test [JustCompile] it is public).

Also the current singletons are not really singletons, because queries that are 
unserialized will contain instances that are not the "singleton" instances :) - 
and will therefore fail to produce correct hashcode/equals tests. The problem 
behind: The singletons are serializable but do not return itsself in 
readResolve() (not implemented). All singletons that are serializable must 
implement readResolve and return the singleton instance (see Parameter base 
class or the parser singletons in FieldCache).

The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct? So one could modify the defaults by setting properties in this 
instance. Is this correct?
  
> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-ma

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves commented on LUCENE-1486:


Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6

  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be return by or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this does not seem to be working

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
can you describe what is the behavior here.
Look like the and is convert into a OR, that the case.
What is the behavior you want to implement.




> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:13 PM:
-

Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{{monospaced}}
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
{{monospaced}}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6

  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be return by or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this does not seem to be working

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
can you describe what is the behavior here.
Look like the and is convert into a OR, that the case.
What is the behavior you want to implement.



  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:19 PM:
-

Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{monospaced}
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
{monospaced}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{{monospaced}}
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
{{monospaced}}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:21 PM:
-

Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
...
{code}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{monospaced}
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
{monospaced}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:22 PM:
-

Mark H - 

Question 1)

I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
...
{code}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I also have a question about position. I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
...
{code}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add

[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734323#action_12734323
 ] 

Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:24 PM:
-

Mark H - 

Question 1)

I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
...
{code}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned.
Is this the correct behavior?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
for query:
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
seems that like the AND is converted into a OR.
What is the behavior you want to implement?




  was (Author: lafa):
Mark H - 

Question 1)

I added a doc 5 and 6
{code:title=TestComplexPhraseQuery.java|borderStyle=solid}
...
  DocData docsContent[] = { new DocData("john smith", "1"),
  new DocData("johathon smith", "2"),  
  new DocData("john percival smith goes on  a b c vacation", "3"),
  new DocData("jackson waits tom", "4"),
  new DocData("johathon smith john", "5"),
  new DocData("johathon mary gomes smith", "6"),
  };
...
{code}

for test 
checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned,
I'm assuming position is always important and doc 5 is supposed to be returned, 
correct?

Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3)
checkMatches("\"jo*  smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary)  smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
Can you describe what is the behavior here.
Looks like the and is converted into a OR.
What is the behavior you want to implement?



  
> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734333#action_12734333
 ] 

Luis Alves commented on LUCENE-1486:


Sorry for all the emails, 
I'm still new to JIRA and only now I realized that for every edit I do,a email 
is sent.

But now that I found the preview button, it won't happen again. :)


> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734337#action_12734337
 ] 

Mark Harwood commented on LUCENE-1486:
--

bq. I think it's not a big deal, but I'm just trying to understand and raise a 
probable wrong test.

Granted, the test fails for a reason other than the one for which I wanted it 
to fail. 
We can probably strike the test and leave a note saying phrase-within-a-phrase 
just does not make sense and is not supported.

bq.  Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or 
the default boolean operator (usually OR)?

In brackets it's an OR - the brackets are used to suggest that the current 
phrase element at position X is composed of some choices that are evaluated as 
a subclause in the same way that in normal query logic sub-clauses are defined 
in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on 
while evaluating the bracketed innards of phrases just in case the base class 
has AND as the default.

bq. Mark H, can you please elaborate more on the these other operators "+" "-" 
"^" "AND" "&&" "||" "NOT" "!" ":" "[" "]" "{" "}".

OK I'll try and deal with them one by one but these are not necessarily 
definitive answers or guarantees of correctly implemented support

OR,||,+, AND, && . ignored. The implicit operator is AND_NEXT_TO apart from 
in bracketed sections where all elements at this level are ORed
^ .boosts are carried through from TermQuerys to SpanTermQuerys
NOT, ! Creates SpanNotQueries 
[]{} range queries are supported as are wildcards *, fuzzies  ~, ?

bq. query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"


I'll post the XML query syntax equivalent of what should be parsed here shortly 
(just seen your next comment come in) 





> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734349#action_12734349
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}for test checkMatches("\"(jo* -john) smyth\"", "2"); 
would document 5 be returned or just doc 2 should be returned,
{quote}

I presume you mean smith not smyth here otherwise nothing would match? If so, 
doc 5 should match and position is relevant (subject to slop factors).

{quote}
Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work
{quote}

I suppose there's an open question as to if the second example is legal (the 
brackets are unnecessary)



{quote}
Question 3)
checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.
{quote}

That looks like a bug related to slop factor?

{quote}
Question 4)
The usage of AND and AND_NEXT_TO is confusing to me
the query 
checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with
{quote}
ANDs are ignored and turned into ORs (see earlier comments) but maybe a query 
parse error should be thrown to emphasise this.





> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

getTermInfosIndexDivisor deprecated?

2009-07-22 Thread Jason Rutherglen

It's a get method but the UnsupportedOperationException says "Please
pass termInfosIndexDivisor up-front when opening IndexReader"?  I did
pass it in.  Writing a test case for Solr that checks it.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734355#action_12734355
 ] 

Mark Harwood commented on LUCENE-1486:
--

{quote}
query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"
{quote}
Would be parsed as follows (shown as equivalent XMLQueryParser syntax)
{code:xml} 

  
 

john jonathon 


 smith smyth


 order orders

   
 

 sell 
 

 "stock market" 
 
 
{code}


> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: getTermInfosIndexDivisor deprecated?

Yeah this was deprecated in LUCENE-1609; I guess we could keep the
getter alive?  I'll reopen it.

Mike

On Wed, Jul 22, 2009 at 6:07 PM, Jason
Rutherglen wrote:
> It's a get method but the UnsupportedOperationException says "Please
> pass termInfosIndexDivisor up-front when opening IndexReader"?  I did
> pass it in.  Writing a test case for Solr that checks it.
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead


 [ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1609:



Reopening to un-deprecate getTermInfosIndexDivisor.

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, 
> LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [ApacheCon US] Travel Assistance

2009-07-22 Thread Luis Alves


Hi Grant,

I'm locate in the Bay Area, so I don't need hotel or car and I could 
drive to Oakland,

but I was not able to get my company to pay for a apachecon conference Pass.
I also wouldn't mind to do a presentation on the new QueryParser (JIRA 
1567),

if I'm able to get to ApacheCon.

As you said below:
"Applications are open to all open source developers who feel that
their attendance would benefit themselves, their project(s), the ASF
and open source in general. ... Conference fees either in full or in part"

Is the assistance restricted to people presenting and committers?

Regards,
Luis alves

Grant Ingersoll wrote:
The Travel Assistance Committee is taking in applications for those 
wanting
to attend ApacheCon US 2009 (Oakland) which takes place between the 
2nd and

6th November 2009.

The Travel Assistance Committee is looking for people who would like to be
able to attend ApacheCon US 2009 who may need some financial support in
order to get there. There are limited places available, and all 
applications
will be scored on their individual merit. Applications are open to all 
open

source developers who feel that their attendance would benefit themselves,
their project(s), the ASF and open source in general.

Financial assistance is available for flights, accommodation, subsistence
and Conference fees either in full or in part, depending on circumstances.
It is intended that all our ApacheCon events are covered, so it may be
prudent for those in Europe and/or Asia to wait until an event closer to
them comes up - you are all welcome to apply for ApacheCon US of 
course, but

there should be compelling reasons for you to attend an event further away
that your home location for your application to be considered above those
closer to the event location.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a 
link to

the online application and details for submitting.

Applications for applying for travel assistance will open on 27th July 
2009

and close of the 17th August 2009.

Good luck to all those that will apply.

Regards,

The Travel Assistance Committee




--
-Lafa



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [ApacheCon US] Travel Assistance

2009-07-22 Thread Chris Hostetter


: Is the assistance restricted to people presenting and committers?

nope...

http://www.apache.org/travel/index.html


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: getTermInfosIndexDivisor deprecated?

OK done.

Mike

On Wed, Jul 22, 2009 at 7:37 PM, Michael
McCandless wrote:
> Yeah this was deprecated in LUCENE-1609; I guess we could keep the
> getter alive?  I'll reopen it.
>
> Mike
>
> On Wed, Jul 22, 2009 at 6:07 PM, Jason
> Rutherglen wrote:
>> It's a get method but the UnsupportedOperationException says "Please
>> pass termInfosIndexDivisor up-front when opening IndexReader"?  I did
>> pass it in.  Writing a test case for Solr that checks it.
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead


 [ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1609.


Resolution: Fixed

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, 
> LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-07-22 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734398#action_12734398
 ] 

Adriano Crestani commented on LUCENE-1486:
--

{quote}
I propose doing this using using the new QP implementation. (I can write the 
new javacc QP for this)
(this implies that the code will be in contrib in 2.9 and be part of core on 
3.0)
{quote}

That would be good!

{quote}
Granted, the test fails for a reason other than the one for which I wanted it 
to fail.
We can probably strike the test and leave a note saying phrase-within-a-phrase 
just does not make sense and is not supported.
{quote}

Cool, I agree to remove it. But I still don't see how an user can type a phrase 
inside a phrase with the current syntax definition, can you give me an example?

{quote}
In brackets it's an OR - the brackets are used to suggest that the current 
phrase element at position X is composed of some choices that are evaluated as 
a subclause in the same way that in normal query logic sub-clauses are defined 
in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on 
while evaluating the bracketed innards of phrases just in case the base class 
has AND as the default.
{quote}

If we use the implemented java cc code Luis suggested, we would have already a 
query parser that throws ParseExceptions whenever the user types an AND inside 
a phrase.

{quote}
OR,||,+, AND, && . ignored
{quote}

So we should throw an excpetion if any of these is found inside a phrase. It 
could confuse the user if we just ignore it.

{quote}
Question 2)
Should these 2 queries behave the same when we fix the problem
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

I suppose there's an open question as to if the second example is legal (the 
brackets are unnecessary)
{quote}

Yes, the second is unnecessary, but I don't think it's illegal. The user could 
type <(smith)> outside the phrase, it makes sense to support it inside also.

{quote}
Question 3)
checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works.
doc 6 is also returned, so this feature does not seem to be working.

That looks like a bug related to slop factor?
{quote}

I have not checked yet, but I think it's working fine. The slop means how many 
switches between the terms inside the phrase is allowed to match the query. It 
matches doc 6, because the term  switches twice to the right and matched 
"johathon mary gomes smith". Twice = slop 2 :)

{quote}
ANDs are ignored and turned into ORs (see earlier comments) but maybe a query 
parse error should be thrown to emphasise this.
{quote}

I think we could support AND also. I agree there are few cases where the user 
would use that. It would work as I explained before:

{quote}
What happens if I type "(query AND parser) lucene". In my point of view it is: 
"(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document 
that contains the term 'query' and the term 'parser' in the position x, and the 
term 'lucene' in the position x+1. Is this the expected behaviour?
{quote}


> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside ph

Re: Lucene 2.9 Again

2009-07-22 Thread Chris Hostetter


: LUCENE-1749 FieldCache introspection API Unassigned 16/Jul/09
: 
:   You have time to work on this Hoss?

i'd have more time if there weren't so many darn solr-user questions that 
no one else answers.

The meat of the patch (adding an API to inspect the cache) could be 
commited as is today -- i just don't know if the API makes sense (needs 
more eyeballs), and the real value add will be getting the sanity testing 
utilities in place ... those are only about half done.

i'll try to work on it more this week(end) but if there isn't any progress 
from me, someone else (ahem: Miller?) should probably prune it down to 
the core function, add whatever javadocs are missing, and commit.

(better to have release with a simple inspection API then to delay 
releasing while a fancy inspection methods gets hashed out)



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood


[ 
https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734411#action_12734411
 ] 

Michael McCandless commented on LUCENE-1644:


bq. I was a little bit confused about the high term number cut off,

Sorry I still need to do some perf testing to pick an appropriate
default here.

bq.  Everywhere else (even in the backwards compatibility override test 
[JustCompile] it is public).  And the same should be for the incNumberOfTerms 
(also protected).

Woops -- I'll fix.  Thanks for catching even though you're on
"vacation" ;)

bq. Also the current singletons are not really singletons, because queries that 
are unserialized will contain instances that are not the "singleton" instances

Sigh.  I'll do what FieldCache's parser singletons do.

bq. The instance in the default Auto RewriteMethod is still modifiable. Is this 
correct?

I was thinking this was OK, ie, you could set the default cutoffs for
anything that used the AUTO_DEFAULT.  But it is static (global), so
that's not great.  I guess I'll make it an anonymous subclass of
ConstantScoreAutoRewrite that disallows changes.


> Enable MultiTermQuery's constant score mode to also use BooleanQuery under 
> the hood
> ---
>
> Key: LUCENE-1644
> URL: https://issues.apache.org/jira/browse/LUCENE-1644
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch
>
>
> When MultiTermQuery is used (via one of its subclasses, eg
> WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
> "constant score mode", which pre-builds a filter and then wraps that
> filter as a ConstantScoreQuery.
> If you don't set that, it instead builds a [potentially massive]
> BooleanQuery with one SHOULD clause per term.
> There are some limitations of this approach:
>   * The scores returned by the BooleanQuery are often quite
> meaningless to the app, so, one should be able to use a
> BooleanQuery yet get constant scores back.  (Though I vaguely
> remember at least one example someone raised where the scores were
> useful...).
>   * The resulting BooleanQuery can easily have too many clauses,
> throwing an extremely confusing exception to newish users.
>   * It'd be better to have the freedom to pick "build filter up front"
> vs "build massive BooleanQuery", when constant scoring is enabled,
> because they have different performance tradeoffs.
>   * In constant score mode, an OpenBitSet is always used, yet for
> sparse bit sets this does not give good performance.
> I think we could address these issues by giving BooleanQuery a
> constant score mode, then empower MultiTermQuery (when in constant
> score mode) to pick & choose whether to use BooleanQuery vs up-front
> filter, and finally empower MultiTermQuery to pick the best (sparse vs
> dense) bit set impl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood