[jira] Updated: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1791:
-

Attachment: LUCENE-1791.patch

I figured out the problem with TestComplexExplanations ... the test uses a 
searcher with a Custom Similarity, and the new code wasn't setting the same 
Similarity on the new Searcher & MultiSearcher being created.

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SpanQuery and Spans optimizations

2009-08-12 Thread Paul Cowan

Michael McCandless wrote:

I think eventually span queries should be absorbed into the normal
lucene queries.  EG, if TermQuery creates a scorer that's able to
optionally enumerate matching spans, such that there's no performance
loss if you don't actuallly request the spans, then we don't need
SpanTermQuery.


+1 from me. In fact, +. I think if it were possible 
for all queries to be able to provide spans in a consistent way, that 
would be absolutely brilliant. Not sure how best to do this, but this 
would make Spans a lot more useful.


e.g. I was going to have to implement SpanRangeQuery recently, when I 
needed one (though I found a 3rd-party implementation which works fine) 
-- it would be nice if TermRangeQuery just supported this out of the 
box, it would be a lot more flexible.


Happy to help with this effort, this would be really useful for us.

Cheers,

Paul

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742683#action_12742683
 ] 

Michael Busch commented on LUCENE-1801:
---

Patch looks good, Uwe!

When I change Token.clear() to also set the offsets to 0 and the type to 
DEFAULT_TYPE, then still 'test-core', 'test-contrib' and 'test-tag' all pass. 
I think we could make that change and add a comment to the 
backwards-compatibility section of CHANGES.txt? I think it is the right 
behavior to reset everything in Tokenizer. 

Also the comment in Token.clear() suggests that the only reason offset and type 
are not cleared is that tokenizers usually overwrite them anyways; so we're not 
changing the suggested behavior, and I doubt that people are really relying on 
the fact that offsets and type are currently not cleared?

So in summary, if we:
- change all tokenizers to call clearAttributes() first in incrementToken(),
- remove clear() from Attribute and leave it in AttributeImpl,
- change Token.clear() to reset all members and add a comment about that in 
CHANGES.txt,

then everything seems good, or is then there still a problem that I'm missing?

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1794) implement reusableTokenStream for all contrib analyzers

2009-08-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1794:


Attachment: LUCENE-1794.patch

add reusable/reset impls for shingles, snowball, and memory/synonym.
memory/synonym had no previous tests afaik.
tests are still needed for compound,ngram, and shingles reset()
memory/PatternAnalyzer still does not use reusableTS
and there are two wrappers: shingle/ShingleAnalyzerWrapper and 
query/QueryAutoStopWordAnalyzer that should be fixed and tested.

unfortunately something came up at work, so I may be slow on this, if you want 
to jump in, please help!
and let me know what you are tackling, I will do my best to work this issue 
late night to get it resolved.


> implement reusableTokenStream for all contrib analyzers
> ---
>
> Key: LUCENE-1794
> URL: https://issues.apache.org/jira/browse/LUCENE-1794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, 
> LUCENE-1794.patch
>
>
> most contrib analyzers do not have an impl for reusableTokenStream
> regardless of how expensive the back compat reflection is for indexing speed, 
> I think we should do this to mitigate any performance costs. hey, overall it 
> might even be an improvement!
> the back compat code for non-final analyzers is already in place so this is 
> easy money in my opinion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742649#action_12742649
 ] 

Yonik Seeley commented on LUCENE-1801:
--

bq. As they are the source of tokens, they must call clearAttributes()

Note: I had assumed that restoreState() would be enough if there was saved 
state being restored... but after checking the docs, it's not.

Makes me wonder if there could be a more efficient clearAndRestoreState(State) 
that clears attributes not in the state?
Patch for another day I suppose...

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1796) Speed up repeated TokenStream init

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742644#action_12742644
 ] 

Uwe Schindler commented on LUCENE-1796:
---

I opened LUCENE-1801 for that. A patch is available and will be committed soon.

> Speed up repeated TokenStream init
> --
>
> Key: LUCENE-1796
> URL: https://issues.apache.org/jira/browse/LUCENE-1796
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: after.png, afterAndLucene1796.png, before.png, 
> LUCENE-1796.patch, LUCENE-1796.patch, LUCENE-1796.patch, LUCENE-1796.patch, 
> LUCENE-1796.patch
>
>
>  by caching isMethodOverridden results

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742646#action_12742646
 ] 

Robert Muir commented on LUCENE-1801:
-

Uwe, get some rest.

I will double-check later and see. Personally I do not like things that behave 
as Tokenizer but are TokenStream, not Tokenizer... this is another issue for 
another day!

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1801:
--

Attachment: (was: LUCENE-1801.patch)

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742643#action_12742643
 ] 

Uwe Schindler commented on LUCENE-1801:
---

No problem, thanks for the patch. I was not aware that there were two inner 
classes. But they are no Tokenizers but TokenStreams. As they are the source of 
tokens, they must call clearAttributes(), you are right, thanks!

If you find another one, please post a patch again, maybe I forgot more of 
them. I will go to bed now.

Uwe

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1801:


Attachment: LUCENE-1801.patch

sorry for the bad encoding issue!

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742638#action_12742638
 ] 

Robert Muir commented on LUCENE-1801:
-

uwe, sorry I see there is an encoding problem with my patch file... i will 
supply another.

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1801:


Attachment: LUCENE-1801.patch

with clearAttributes for the secret and super-secret tokenizer inside 
memory/PatternAnalyzer

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742634#action_12742634
 ] 

Mark Miller commented on LUCENE-1791:
-

I only get the NAN issue showing up now.
 
I don't know why its returning NAN as the score at the moment, but as far as 
the test is concerned, that particular failure appears to be false. Its just 
checking that two calls to score return the same things - but if it returns 
NAN, they are not considered equal.

So I guess the question is - does it make sense that the score is NAN?

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742626#action_12742626
 ] 

Robert Muir commented on LUCENE-1801:
-

Uwe, I can supply updated patch to yours if you want, since I am already 
staring at it!

In order to support reuse, it will need to be changed a little bit, but for 
now, we can simply resolve the clearAttributes issue


> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742620#action_12742620
 ] 

Uwe Schindler commented on LUCENE-1801:
---

Thanks Robert!

Can you look into this special "Tokenizer" for correct "initialization" 
according to Yonik's comments?

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742618#action_12742618
 ] 

Michael Busch commented on LUCENE-1801:
---

Sorry, Uwe. I'm in meetings.

I'll look into this tonight!

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742616#action_12742616
 ] 

Robert Muir commented on LUCENE-1801:
-

Uwe, there is also a tokenizer in contrib/memory inside PatternAnalyzer.

i only mention this because I am trying to hunt down all analyzers/tokenstreams 
to check reuse/reset at the moment :) 

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742614#action_12742614
 ] 

Hoss Man commented on LUCENE-1791:
--

FYI: with mark's updated path, we're back to just the NaN failures from 
TestComplexExplanations.  (and possibly TestBoolean2.testRandomQueries, but i 
can't confirm that)

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1801:
--

Attachment: LUCENE-1801.patch

Attached is a patch that implements clearAttributes() in all Tokenizers and 
other source of Tokens. It also removes clear() from Attribute basic interface.

I commit in a day or two if nobody objects.

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742608#action_12742608
 ] 

Mark Miller commented on LUCENE-1791:
-

Okay - so the first the original Parser issue:

The output looked odd - it showed multiple entries with the same type, field, 
and reader, but null and DefaultParser.

When I quickly switched the code to use DefaultParser instead of null, those 
extra entries went away, and it just showed the segmentreader and directory 
reader that were doubled up. Thats what had me thinking that was involved. I'm 
not sure I understand why that was happening now though.

Anyway, all of the issues appear to be because the test code was written 
expecting all of the readers to be top level.

The tests, in certain cases, use a Reader to grab from a fieldcache - that 
reader has to be the right subreader and not the top level reader -

the tests were just using getSequentialSubReaders - they need use 
gatherSubReaders instead - because you introduced the multi level reader stuff. 
They should have used gatherSubReaders from the start, but because it wasn't 
needed at the time (for the tests to pass), it didn't even occur to me.

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742606#action_12742606
 ] 

Hoss Man commented on LUCENE-1791:
--

midair collision (x2) ... i think i see what you mean in your revised patch ... 
the tests don't need changed, it's just the test utility methods that were 
trying to recurse the readers that weren't doing the entire job.

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742603#action_12742603
 ] 

Hoss Man commented on LUCENE-1791:
--

{quote}
Well that explains half the output anyway - even if thats fixed there is still 
a fail. Its because the tests doesn't expand fully into the subreaders - just 
needed the top level before - with this test, we need to recursively grab them.
{quote}
You lost me there... are you saying the _tests_ needs to be changed? ... why?  

For this patch to trigger an error in an existing test, that test must either 
be using CheckHits or QueryUtils to execute a query against a seracher and 
validate the results are ok ... why would the test be responsible for any 
subreader expansion in this case?

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1791:


Attachment: LUCENE-1791.patch

just fully rolling out to all of the subreaders makes the test pass I believe.

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch, LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742600#action_12742600
 ] 

Hoss Man commented on LUCENE-1791:
--

bq. I'm guess the NAN failures are not a problem - looks like they fail because 
NAN != NAN?

right -- but why would the scores be NaN when wrapped in a MultiReader? when 
it's *not* wrapped in a MultiReader the test passes, so the scores must not be 
NaN in that case.

bq. I don't think the fieldcache insanity is multi-reader related [...] same 
stuff now, doubled entry.

The sanity checker ignores when two CacheEntries differ only by parser 
(precisely because of the the null/default parser issue) and the resulting 
value object is the same.  but it does include all related CacheEntry objects 
in an Insanity object so that you have them all for debugging.

Looking at TestCustomScoreQuery.testCustomScoreByte (for example)...

{code}
*** BEGIN 
org.apache.lucene.search.function.TestCustomScoreQuery.testCustomScoreByte: 
Insane FieldCache usage(s) ***
SUBREADER: Found caches for decendents of 
org.apache.lucene.index.directoryrea...@88d2ae+iii

'org.apache.lucene.index.directoryrea...@88d2ae'=>'iii',byte,null=>[B#841343 
(size =~ 33 bytes)

'org.apache.lucene.index.directoryrea...@88d2ae'=>'iii',byte,org.apache.lucene.search.FieldCache.DEFAULT_BYTE_PARSER=>[B#841343
 (size =~ 33 bytes)

'org.apache.lucene.index.compoundfilereader$csindexin...@77daaa'=>'iii',byte,org.apache.lucene.search.FieldCache.DEFAULT_BYTE_PARSER=>[B#981898
 (size =~ 33 bytes)

'org.apache.lucene.index.compoundfilereader$csindexin...@77daaa'=>'iii',byte,null=>[B#981898
 (size =~ 33 bytes)

*** END 
org.apache.lucene.search.function.TestCustomScoreQuery.testCustomScoreByte: 
Insane FieldCache usage(s) ***

{code}

The insanity type is "SUBREADER", so it's specificly identified a problem with 
that type of relationship.  There are 4 CacheEntries listed in the error all 
from the same field, but from two different readers.  If you note the value 
identity hashcodes (just before the size estimate) each reader has only one 
value cached for that field (with different parsers) which is why there isn't a 
seperate error about the multiple values. 
as the first line of hte Instanity.toString() states: what it found is that 
directoryrea...@88d2ae and at least one of it's decendents both have cached 
entires for the same field.



> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742598#action_12742598
 ] 

Mark Miller commented on LUCENE-1791:
-

{quote}I don't think the fieldcache insanity is multi-reader related - it looks 
to me like some entries have a parser, and some null for the parser, even 
though the default parser is being used in both cases. The FieldSource types 
grab a FieldCache and may pass null as the parser, which ends up putting null 
in the cache entry - but if you specifically ask for the default parser, that 
puts the default parser in the fieldcache entry - same stuff now, doubled 
entry.{quote}

Well that explains half the output anyway - even if thats fixed there is still 
a fail. Its because the tests doesn't expand fully into the subreaders - just 
needed the top level before - with this test, we need to recursively grab them.


> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742586#action_12742586
 ] 

Mark Miller edited comment on LUCENE-1791 at 8/12/09 2:35 PM:
--

I'm guess the NAN failures are not a problem - looks like they fail because NAN 
!= NAN ? Havn't looked closer.

I don't think the fieldcache insanity is multi-reader related - it looks to me 
like some entries have a parser, and some null for the parser, even though the 
default parser is being used in both cases. The FieldSource types grab a 
FieldCache and may pass null as the parser, which ends up putting null in the 
cache entry - but if you specifically ask for the default parser, that puts the 
default parser in the fieldcache entry - same stuff now, doubled entry.

As for the out of bounds - havn't look at that one yet - odd one ...

... interesting - it alternates between nullpointer and out of bounds 
exceptions ...

  was (Author: markrmil...@gmail.com):
I'm guess the NAN failures are not a problem - looks like they fail because 
NAN != NAN ? Havn't looked closer.

I don't think the fieldcache insanity is multi-reader related - it looks to me 
like some entries have a parser, and some null for the parser, even though the 
default parser is being used in both cases. The FieldSource types grab a 
FieldCache and may pass null as the parser, which ends up putting null in the 
cache entry - but if you specifically ask for the default parser, that puts the 
default parser in the fieldcache entry - same stuff now, doubled entry.

As for the out of bounds - havn't look at that one yet - odd one ...
  
> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742586#action_12742586
 ] 

Mark Miller commented on LUCENE-1791:
-

I'm guess the NAN failures are not a problem - looks like they fail because NAN 
!= NAN ? Havn't looked closer.

I don't think the fieldcache insanity is multi-reader related - it looks to me 
like some entries have a parser, and some null for the parser, even though the 
default parser is being used in both cases. The FieldSource types grab a 
FieldCache and may pass null as the parser, which ends up putting null in the 
cache entry - but if you specifically ask for the default parser, that puts the 
default parser in the fieldcache entry - same stuff now, doubled entry.

As for the out of bounds - havn't look at that one yet - odd one ...

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742574#action_12742574
 ] 

Uwe Schindler commented on LUCENE-1801:
---

Any comments here? I will be unavailable until the weekend, so please make some 
suggestions, especially about the clear() problem. The first part with 
clearAttributes is easy, its just adding some code, the second one is 
refactoring.
Michael B., what do you think?

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> ---
>
> Key: LUCENE-1801
> URL: https://issues.apache.org/jira/browse/LUCENE-1801
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-08-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1748.
-

Resolution: Fixed

thanks for taking a look Mike!

> getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
> --
>
> Key: LUCENE-1748
> URL: https://issues.apache.org/jira/browse/LUCENE-1748
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4, 2.4.1
> Environment: all
>Reporter: Hugh Cayless
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1748.patch, LUCENE-1748.patch, LUCENE-1748.patch
>
>
> I just spent a long time tracking down a bug resulting from upgrading to 
> Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
> written against 2.3.  Since the project's SpanQuerys didn't implement 
> getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
> which returned null and caused a NullPointerException in the Lucene code, far 
> away from the actual source of the problem.  
> It would be much better for this kind of thing to show up at compile time, I 
> think.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742552#action_12742552
 ] 

Hoss Man commented on LUCENE-1791:
--

bq. (grr it uses it's own random so no seed was logged)

Correction: it does log the seed, i was just looking for stderr when i should 
have been looking for stdout...

{code}
failed query: +field:w2 field:w3 field:xx field:w4 field:w2
NOTE: random seed of testcase 'testRandomQueries' was: 5695251427490718890
{code}

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1791) Enhance QueryUtils and CheckHIts to wrap everything they check in MultiReader/MultiSearcher

2009-08-12 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1791:
-

Fix Version/s: 2.9

I just retried this patch against the trunk now that the 
FieldCacheSanityChecker and some other patches have been committed.  In 
addition to the possibly false-negatives from TestComplexExplanation (NaN 
score) this is now surfacing FielCache sanity failures from 
TestCustomScoreQuery, TestFieldScoreQuery, and TestOrdValues (suggesting that 
there are code paths where those query types don't correctly use the subreaders 
to get the FieldCache) as well as checkFirstSkipTo() failures for 
TestSpansAdvanced2 and an ArrayIndexOutOfBoundsException from 
TestBoolean2.testRandomQueries  (grr it uses it's own random so no seed was 
logged)

I don't pretend this patch is perfect, but i can't imagine these are all 
false-negatives.  

We should get to the bottom of this before 2.9.  I'll start trying to figure it 
out on the train tonight.

> Enhance QueryUtils and CheckHIts to wrap everything they check in 
> MultiReader/MultiSearcher
> ---
>
> Key: LUCENE-1791
> URL: https://issues.apache.org/jira/browse/LUCENE-1791
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Hoss Man
> Fix For: 2.9
>
> Attachments: LUCENE-1791.patch
>
>
> methods in CheckHits & QueryUtils are in a good position to take any Searcher 
> they are given and not only test it, but also test MultiReader & 
> MultiSearcher constructs built around them

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1749) FieldCache introspection API

2009-08-12 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved LUCENE-1749.
--

Resolution: Fixed
  Assignee: Hoss Man

Committed revision 803676.


> FieldCache introspection API
> 
>
> Key: LUCENE-1749
> URL: https://issues.apache.org/jira/browse/LUCENE-1749
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
> Attachments: fieldcache-introspection.patch, 
> LUCENE-1749-hossfork.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch
>
>
> FieldCache should expose an Expert level API for runtime introspection of the 
> FieldCache to provide info about what is in the FieldCache at any given 
> moment.  We should also provide utility methods for sanity checking that the 
> FieldCache doesn't contain anything "odd"...
>* entries for the same reader/field with different types/parsers
>* entries for the same field/type/parser in a reader and it's subreader(s)
>* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1749) FieldCache introspection API

2009-08-12 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Attachment: LUCENE-1749.patch

one last updated: the Locale.US asserts in TestRemoteSort had the same problem 
as TestSort, they were suppose to be moved, but instead they were just copied 
(not sure how i missed that before)

> FieldCache introspection API
> 
>
> Key: LUCENE-1749
> URL: https://issues.apache.org/jira/browse/LUCENE-1749
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
> Attachments: fieldcache-introspection.patch, 
> LUCENE-1749-hossfork.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch
>
>
> FieldCache should expose an Expert level API for runtime introspection of the 
> FieldCache to provide info about what is in the FieldCache at any given 
> moment.  We should also provide utility methods for sanity checking that the 
> FieldCache doesn't contain anything "odd"...
>* entries for the same reader/field with different types/parsers
>* entries for the same field/type/parser in a reader and it's subreader(s)
>* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1802) Un-deprecate QueryParser and remove documentation that says it will be replaced in 3.0

2009-08-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1802:
---

Attachment: LUCENE-1802.patch

Attached patch.  I think it's ready to commit... I'll wait a day or so.

> Un-deprecate QueryParser and remove documentation that says it will be 
> replaced in 3.0
> --
>
> Key: LUCENE-1802
> URL: https://issues.apache.org/jira/browse/LUCENE-1802
> Project: Lucene - Java
>  Issue Type: Task
>  Components: QueryParser
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: LUCENE-1802.patch
>
>
> This looks like the consensus move at first blush. We can (of course) 
> re-evaluate if things change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-1804.
--

   Resolution: Fixed
Fix Version/s: 2.9

Committed.

I'm not sure it's worth adding constructors for all combinations of parameters, 
esp when the trend is toward reuse, and specifying the reader separately - but 
I think that can be a different issue (whether to remove some of the existing 
constructors or not).

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1800) QueryParser should use reusable token streams

2009-08-12 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-1800.
--

Resolution: Fixed

committed.

> QueryParser should use reusable token streams
> -
>
> Key: LUCENE-1800
> URL: https://issues.apache.org/jira/browse/LUCENE-1800
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1800.patch
>
>
> Just like indexing, the query parser should use reusable token streams

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1802) Un-deprecate QueryParser and remove documentation that says it will be replaced in 3.0

2009-08-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1802:
--

Assignee: Michael McCandless

> Un-deprecate QueryParser and remove documentation that says it will be 
> replaced in 3.0
> --
>
> Key: LUCENE-1802
> URL: https://issues.apache.org/jira/browse/LUCENE-1802
> Project: Lucene - Java
>  Issue Type: Task
>  Components: QueryParser
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: 2.9
>
>
> This looks like the consensus move at first blush. We can (of course) 
> re-evaluate if things change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742514#action_12742514
 ] 

Michael McCandless commented on LUCENE-1748:


Patch looks good... just need to fix back-compat tests.

> getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
> --
>
> Key: LUCENE-1748
> URL: https://issues.apache.org/jira/browse/LUCENE-1748
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4, 2.4.1
> Environment: all
>Reporter: Hugh Cayless
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1748.patch, LUCENE-1748.patch, LUCENE-1748.patch
>
>
> I just spent a long time tracking down a bug resulting from upgrading to 
> Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
> written against 2.3.  Since the project's SpanQuerys didn't implement 
> getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
> which returned null and caused a NullPointerException in the Lucene code, far 
> away from the actual source of the problem.  
> It would be much better for this kind of thing to show up at compile time, I 
> think.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1800) QueryParser should use reusable token streams

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742511#action_12742511
 ] 

Michael McCandless commented on LUCENE-1800:


Patch looks good!

> QueryParser should use reusable token streams
> -
>
> Key: LUCENE-1800
> URL: https://issues.apache.org/jira/browse/LUCENE-1800
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1800.patch
>
>
> Just like indexing, the query parser should use reusable token streams

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

2009-08-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1789:
---

Attachment: LUCENE-1789.patch

Attached patch.

> getDocValues should provide a MultiReader DocValues abstraction
> ---
>
> Key: LUCENE-1789
> URL: https://issues.apache.org/jira/browse/LUCENE-1789
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Hoss Man
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1789.patch
>
>
> When scoring a ValueSourceQuery, the scoring code calls 
> ValueSource.getValues(reader) on *each* leaf level subreader -- so DocValue 
> instances are backed by the individual FieldCache entries of the subreaders 
> -- but if Client code were to inadvertently  called getValues() on a 
> MultiReader (or DirectoryReader) they would wind up using the "outer" 
> FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here 
> that we don't have with FieldCache API (which is required to provide direct 
> array access). getValues(IndexReader) could be implimented so that *IF* some 
> a caller inadvertently passes in a reader with non-null subReaders, getValues 
> could generate a DocValues instance for each of the subReaders, and then wrap 
> them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1749) FieldCache introspection API

2009-08-12 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Attachment: LUCENE-1749.patch

updated patch to trunk (QueryWeight->Weight) and tweaked some FieldCacheImpl 
methods to use the non-deprecated Entry constructors (forgot that part before)

I'll commit as soon as my test run is finished.

> FieldCache introspection API
> 
>
> Key: LUCENE-1749
> URL: https://issues.apache.org/jira/browse/LUCENE-1749
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
> Attachments: fieldcache-introspection.patch, 
> LUCENE-1749-hossfork.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch, 
> LUCENE-1749.patch
>
>
> FieldCache should expose an Expert level API for runtime introspection of the 
> FieldCache to provide info about what is in the FieldCache at any given 
> moment.  We should also provide utility methods for sanity checking that the 
> FieldCache doesn't contain anything "odd"...
>* entries for the same reader/field with different types/parsers
>* entries for the same field/type/parser in a reader and it's subreader(s)
>* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

2009-08-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1789:
--

Assignee: Michael McCandless

> getDocValues should provide a MultiReader DocValues abstraction
> ---
>
> Key: LUCENE-1789
> URL: https://issues.apache.org/jira/browse/LUCENE-1789
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Hoss Man
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> When scoring a ValueSourceQuery, the scoring code calls 
> ValueSource.getValues(reader) on *each* leaf level subreader -- so DocValue 
> instances are backed by the individual FieldCache entries of the subreaders 
> -- but if Client code were to inadvertently  called getValues() on a 
> MultiReader (or DirectoryReader) they would wind up using the "outer" 
> FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here 
> that we don't have with FieldCache API (which is required to provide direct 
> array access). getValues(IndexReader) could be implimented so that *IF* some 
> a caller inadvertently passes in a reader with non-null subReaders, getValues 
> could generate a DocValues instance for each of the subReaders, and then wrap 
> them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742467#action_12742467
 ] 

Michael McCandless commented on LUCENE-1789:


OK, I'll take a crack at this!

> getDocValues should provide a MultiReader DocValues abstraction
> ---
>
> Key: LUCENE-1789
> URL: https://issues.apache.org/jira/browse/LUCENE-1789
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
>
> When scoring a ValueSourceQuery, the scoring code calls 
> ValueSource.getValues(reader) on *each* leaf level subreader -- so DocValue 
> instances are backed by the individual FieldCache entries of the subreaders 
> -- but if Client code were to inadvertently  called getValues() on a 
> MultiReader (or DirectoryReader) they would wind up using the "outer" 
> FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here 
> that we don't have with FieldCache API (which is required to provide direct 
> array access). getValues(IndexReader) could be implimented so that *IF* some 
> a caller inadvertently passes in a reader with non-null subReaders, getValues 
> could generate a DocValues instance for each of the subReaders, and then wrap 
> them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1768) NumericRange support for new query parser

2009-08-12 Thread Adriano Crestani (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742458#action_12742458
 ] 

Adriano Crestani commented on LUCENE-1768:
--

{quote}
I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is the 
correct name???) in one configuration class 
{quote}

+1 this way is easier for the user to config 

{quote}
I was thinking about that, too. But here the API clearly defines, that 
getRangeQuery() returns a Query object without further specification. So the 
change was correct from the API/BW side. The change that another object is 
returned is documented in CHANGES.txt (as far as I know). We have here the same 
problem: You change the inner class implementations, but the abstract 
QueryParser's API is stable. The general contract when doing such things is, 
that you use instanceof checks before you try to cast some abstract return type 
to something specific, not documented.
{quote}

Agreed, I also think it's fine as long as it's documented

> NumericRange support for new query parser
> -
>
> Key: LUCENE-1768
> URL: https://issues.apache.org/jira/browse/LUCENE-1768
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: QueryParser
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
>
> It would be good to specify some type of "schema" for the query parser in 
> future, to automatically create NumericRangeQuery for different numeric 
> types? It would then be possible to index a numeric value 
> (double,float,long,int) using NumericField and then the query parser knows, 
> which type of field this is and so it correctly creates a NumericRangeQuery 
> for strings like "[1.567..*]" or "(1.787..19.5]".
> There is currently no way to extract if a field is numeric from the index, so 
> the user will have to configure the FieldConfig objects in the ConfigHandler. 
> But if this is done, it will not be that difficult to implement the rest.
> The only difference between the current handling of RangeQuery is then the 
> instantiation of the correct Query type and conversion of the entered numeric 
> values (simple Number.valueOf(...) cast of the user entered numbers). 
> Evenerything else is identical, NumericRangeQuery also supports the MTQ 
> rewrite modes (as it is a MTQ).
> Another thing is a change in Date semantics. There are some strange flags in 
> the current parser that tells it how to handle dates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

2009-08-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742460#action_12742460
 ] 

Hoss Man commented on LUCENE-1789:
--

Cool... i don't suppose you have time to work on a patch? 

(what's the emoticon for fingers crossed?)

> getDocValues should provide a MultiReader DocValues abstraction
> ---
>
> Key: LUCENE-1789
> URL: https://issues.apache.org/jira/browse/LUCENE-1789
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Hoss Man
>Priority: Minor
> Fix For: 2.9
>
>
> When scoring a ValueSourceQuery, the scoring code calls 
> ValueSource.getValues(reader) on *each* leaf level subreader -- so DocValue 
> instances are backed by the individual FieldCache entries of the subreaders 
> -- but if Client code were to inadvertently  called getValues() on a 
> MultiReader (or DirectoryReader) they would wind up using the "outer" 
> FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here 
> that we don't have with FieldCache API (which is required to provide direct 
> array access). getValues(IndexReader) could be implimented so that *IF* some 
> a caller inadvertently passes in a reader with non-null subReaders, getValues 
> could generate a DocValues instance for each of the subReaders, and then wrap 
> them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742444#action_12742444
 ] 

Michael McCandless commented on LUCENE-1458:


Thanks for modernizing the patch Michael!  I'll get back to this one soon... 
I'd really love to get PForDelta working as a codec.  It's a great test case 
since it's block-based, ie, very different from the other codecs.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742441#action_12742441
 ] 

Yonik Seeley commented on LUCENE-1804:
--

OO design principal of not removing functionality - Tokenizer's superclass can 
specify it's AttributeSource... why can't Tokenizer?  We shouldn't disallow it 
just because we can't immediately think of a use case.

bq. I am still not sure, why a simple TokenFilter does not serve the same 
pupose you would like to have with Tokenizer here.

Simplest case: a Tokenizer that delegates to an existing Tokenizer or 
TokenStream?

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742439#action_12742439
 ] 

Uwe Schindler commented on LUCENE-1804:
---

Normally it would be ok. E.g. in the reuse of TokenStreams, the simpliest would 
be to create the tokenizer with a null Reader first and only reset(Reader) it 
before first use. I think, this has historical reasons and to keep consistent 
we should add the ctors. Or deprecate all Reader ctors and state, that you 
should create a reusable Tokenizer and call reset(Reader).

I am still not sure, why a simple TokenFilter does not serve the same pupose 
you would like to have with Tokenizer here. Why not simply wrap the Tokenizer 
with a TokenFilter that already has the possibility to delegate? If it is 
because you miss the reset(Reader) call, we could think about adding this to 
TokenFilter, that passes to the delegated Tokenizer (using instanceof checks).

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742388#action_12742388
 ] 

Yonik Seeley commented on LUCENE-1804:
--

bq. But for completeness, this ctor should also get the Reader/CharStream (as 
all other ctors have the Reader param).

Wouldn't tokenizer.reset(reader) serve the same purpose?  I'm not sure why all 
those different constructors are there.

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742385#action_12742385
 ] 

Uwe Schindler commented on LUCENE-1804:
---

OK, I was wondering, because TokenFilter is there for this pupose and 
TokenStream only provides the AttributeSource ctor because the TokenFilter 
subclass needs this. So one could also simply create a TokenFilter and put it 
ontop of the Tokenizer to wrap? new TokenFilter(new WrappedTokenizer())  - why 
need a Tokenizer for that when TokenFilter is made for it?

But for completeness, this ctor should also get the Reader/CharStream (as all 
other ctors have the Reader param).

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SpanQuery and Spans optimizations

2009-08-12 Thread Grant Ingersoll


On Aug 12, 2009, at 5:58 AM, Michael McCandless wrote:


I think being able to ask the Scorer for matching spans for the
current doc makes tons of sense.

I think eventually span queries should be absorbed into the normal
lucene queries.  EG, if TermQuery creates a scorer that's able to
optionally enumerate matching spans, such that there's no performance
loss if you don't actuallly request the spans, then we don't need
SpanTermQuery.


I recall Michael B. having done some work that would help along these  
terms, but by definition the SpanQueries visit the Spans since that is  
how they access the position information.


One thing that is needed is some concrete evidence comparing the  
performance.  I think people have the idea that they are slower, but  
it isn't clear whether this is true or not, and if it is true, it  
isn't clear how much slower they are.


-Grant 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742354#action_12742354
 ] 

Yonik Seeley commented on LUCENE-1804:
--

It makes delegation possible.  Say one wanted to create a new Tokenizer by 
wrapping an existing Tokenizer or TokenStream.

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742352#action_12742352
 ] 

Uwe Schindler commented on LUCENE-1804:
---

Why do you need this?

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-08-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1748:


Attachment: LUCENE-1748.patch

makes Spans abstract

> getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
> --
>
> Key: LUCENE-1748
> URL: https://issues.apache.org/jira/browse/LUCENE-1748
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4, 2.4.1
> Environment: all
>Reporter: Hugh Cayless
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1748.patch, LUCENE-1748.patch, LUCENE-1748.patch
>
>
> I just spent a long time tracking down a bug resulting from upgrading to 
> Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
> written against 2.3.  Since the project's SpanQuerys didn't implement 
> getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
> which returned null and caused a NullPointerException in the Lucene code, far 
> away from the actual source of the problem.  
> It would be much better for this kind of thing to show up at compile time, I 
> think.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1804:
-

Attachment: LUCENE-1804.patch

> Can't specify AttributeSource for Tokenizer
> ---
>
> Key: LUCENE-1804
> URL: https://issues.apache.org/jira/browse/LUCENE-1804
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Attachments: LUCENE-1804.patch
>
>
> One can't currently specify the attribute source for a Tokenizer like one can 
> with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1804) Can't specify AttributeSource for Tokenizer

2009-08-12 Thread Yonik Seeley (JIRA)
Can't specify AttributeSource for Tokenizer
---

 Key: LUCENE-1804
 URL: https://issues.apache.org/jira/browse/LUCENE-1804
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Yonik Seeley


One can't currently specify the attribute source for a Tokenizer like one can 
with any other TokenStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1800) QueryParser should use reusable token streams

2009-08-12 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1800:
-

Attachment: LUCENE-1800.patch

> QueryParser should use reusable token streams
> -
>
> Key: LUCENE-1800
> URL: https://issues.apache.org/jira/browse/LUCENE-1800
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Fix For: 2.9
>
> Attachments: LUCENE-1800.patch
>
>
> Just like indexing, the query parser should use reusable token streams

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1768) NumericRange support for new query parser

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742308#action_12742308
 ] 

Michael McCandless commented on LUCENE-1768:


bq. I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is 
the correct name???) in one configuration class 

+1

bq. Howerver, there is a solution for this kind of back-compat problem (which I 
don't think it is).

Actually, on reading your explanation I agree it's not really a back compat 
break, since the user's custom builder for RangeQueryNode would still be 
invoked, and the core's builder for NumericRangeQuery would handle the newly 
added numeric range support.  I think this is reasonable.

> NumericRange support for new query parser
> -
>
> Key: LUCENE-1768
> URL: https://issues.apache.org/jira/browse/LUCENE-1768
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: QueryParser
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
>
> It would be good to specify some type of "schema" for the query parser in 
> future, to automatically create NumericRangeQuery for different numeric 
> types? It would then be possible to index a numeric value 
> (double,float,long,int) using NumericField and then the query parser knows, 
> which type of field this is and so it correctly creates a NumericRangeQuery 
> for strings like "[1.567..*]" or "(1.787..19.5]".
> There is currently no way to extract if a field is numeric from the index, so 
> the user will have to configure the FieldConfig objects in the ConfigHandler. 
> But if this is done, it will not be that difficult to implement the rest.
> The only difference between the current handling of RangeQuery is then the 
> instantiation of the correct Query type and conversion of the entered numeric 
> values (simple Number.valueOf(...) cast of the user entered numbers). 
> Evenerything else is identical, NumericRangeQuery also supports the MTQ 
> rewrite modes (as it is a MTQ).
> Another thing is a change in Date semantics. There are some strange flags in 
> the current parser that tells it how to handle dates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-08-12 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reopened LUCENE-1748:
-


> getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
> --
>
> Key: LUCENE-1748
> URL: https://issues.apache.org/jira/browse/LUCENE-1748
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4, 2.4.1
> Environment: all
>Reporter: Hugh Cayless
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1748.patch, LUCENE-1748.patch
>
>
> I just spent a long time tracking down a bug resulting from upgrading to 
> Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
> written against 2.3.  Since the project's SpanQuerys didn't implement 
> getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
> which returned null and caused a NullPointerException in the Lucene code, far 
> away from the actual source of the problem.  
> It would be much better for this kind of thing to show up at compile time, I 
> think.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1796) Speed up repeated TokenStream init

2009-08-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742293#action_12742293
 ] 

Yonik Seeley commented on LUCENE-1796:
--

bq. But in principle we could also change the indexer to call clear before each 
incrementToken() removing the need to do it in every Tokenizer.

Doron brought up a good reason for not doing that in LUCENE-1101.
A tokenizer (or other token producer) could produce multiple tokens before one 
made it to the ultimate consumer (because of stop filters, etc).  So it looks 
like producers should do the clear.

> Speed up repeated TokenStream init
> --
>
> Key: LUCENE-1796
> URL: https://issues.apache.org/jira/browse/LUCENE-1796
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: after.png, afterAndLucene1796.png, before.png, 
> LUCENE-1796.patch, LUCENE-1796.patch, LUCENE-1796.patch, LUCENE-1796.patch, 
> LUCENE-1796.patch
>
>
>  by caching isMethodOverridden results

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Mark Miller
Hey Shai - I'm not saying if new syntax's come lets use it. Sorry if it 
came off that way - I'm basically saying - lets see it get used - lets 
see if the things that it offers are taken advantage of. A new syntax is 
not a plus to me necessarily (though it is nice) - personally, I just 
want a solid core syntax for Lucene, and I think the rest is gravy. But 
because the new QP is billed as easy to develop new syntaxs for (in the 
package.html I think), I'm just saying, lets see how the thing turns 
out. I don't mean to really pin point any one thing in that regard - we 
should just let it breath, and then take stalk again.


Basically I just think we should give it a little time. I only wrote out 
so much, and tried to come up with points like that, because an early 
short comment went ignored ;)


- Mark

Shai Erera wrote:

Mark,

I support not deprecating the current QP.

But I just wanted to comment on "let's wait 'till people add more 
syntaxes". I don't think that that's the issue here. The new QP is 
indeed useful for plugging in different search syntaxes, but I 
personally don't believe that in an application more than one search 
syntax is used. If there are such, then I'd think their number is very 
small. And, I agree w/ you - two different syntaxes are not that 
likely to be able to reuse the same Query tree etc.


However, the new QP, AFAIU, allows one to extend the Lucene syntax 
more easily. And if some extension to Lucene's syntax is useful, why 
contribute it as a contrib module and not augment the default QP?


So just contributing a new query syntax as a contrib module doesn't 
mean the new QP should be used. In fact, I wrote a QP for a different 
syntax than Lucene's, and I didn't use the new QP as base and it works 
just great. In fact, my QP is quite simple, and does not involve 
building a query tree, using builders etc.


And in general I think, writing your own QP for your own query syntax 
is a super advanced thing, which only few do. So this QP will benefit 
the minority of Lucene users / developers, IMO.


So I'm not sure that waiting for users to contribute more syntaxes is 
what we need in order to decide whether this QP should replace the old 
one. We're more likely to see users experiencing problems w/ it (just 
because it's new and hasn't been used in the field much yet) in the 
near future.


This QP currently looks like an OOD exercise. If there will be more 
syntaxes contributed, then it wins. Otherwise, it's just s rewrite of 
the old QP, and we need to be sure that the rewrite is worth it.


Shai

On Wed, Aug 12, 2009 at 1:03 AM, Michael McCandless 
mailto:luc...@mikemccandless.com>> wrote:


+1

Mike

On Tue, Aug 11, 2009 at 5:43 PM, Michael Buschmailto:busch...@gmail.com>> wrote:
> I agree we should not remove the old one in 3.0. That's way too
early.
> If we change the bw-policy we can replace it maybe in 3.1.
>
> On 8/11/09 11:40 AM, Uwe Schindler wrote:
>>
>> Yes, we should not deprecate the old one!
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de 
>>
>>
>>>
>>> -Original Message-
>>> From: Grant Ingersoll [mailto:gsing...@apache.org
]
>>> Sent: Tuesday, August 11, 2009 8:32 PM
>>> To: java-dev@lucene.apache.org 
>>> Subject: Re: The new Contrib QueryParser should not be slated
to replace
>>> the old one yet
>>>
>>> +1, old QP should not be deprecated.  Since the new one is in
contrib,
>>> it should just be stated that it doesn't necessarily have the same
>>> back compat. issues as core, either that or it is marked as
>>> experimental.
>>>
>>> -Grant
>>>
>>> On Aug 11, 2009, at 1:54 PM, Mark Miller wrote:
>>>
>>>

 I don't think we should stick with the current path of
replacing the
 current QueryParser with the new contrib QueryParser in
Lucene 3.0.

 The new QueryParser has not been used much at all yet. Its
 interfaces (which will need to abide by back compat in core) have
 not been vetted enough.

 The new parser appears to add complication to some of things that
 were very simple with the old parser.

 The main benefits of the new parser are claimed to be the
ability to
 plug and play many syntaxes and QueryBuilders. This is not an end
 user benefit though and I'm not even sure how much of a
benefit it
 is to us. There is currently only one impl. It seems to me,
once you
 start another impl, its a long shot that the exact same query
tree
 representation is going to work with a completely different
syntax.
 Sure, if you are just doing postfix rather than prefix, it
w

Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Mark Miller

Michael Busch wrote:


We should also realize that - thanks to Luis and Adriano - we now have 
actual code that can be the basis of discussions and that we can take 
and improve. No matter if this new QP is going to replace the old one 
or not, I'm very thankful that the two went through the effort of 
creating it. This framework has been very successful internally and we 
wanted to share something good with the Lucene community.


 Michael

Agreed! I'd also like to extend my thanks to Luis and Adriano! And to 
IBM for donating the code! I am certainly not looking a gift horse in 
the mouth.


And I think its still very likely this parser will replace the old. 
Despite my rant to not deprecate the current QP yet, I do think its a 
nice design, and I do think it has a lot of value going forward. I just 
think its a big enough deal that we should let it sit for a release in 
contrib while everyone has a chance to take stalk of it. If you look at 
the Qsol parser I used to play around with, it also has an intermediate 
abstract query tree (its just way uglier and less plugable - dont go 
look ;) ) - I think that makes a lot of sense, and I do know that it 
will bring benefits to many developers in the future.


I just think it makes sense to wait a bit and see how things shake out. 
I think that will give us more freedom in terms of addressing the 
shortcomings we/users may find.


Giving users such a large framework to extends is a back compat 
nightmare ! Lets just see how things go for a bit before we start really 
locking in back compat on the thing (which we should presumably do if it 
were to become the new QP).


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SpanQuery and BoostingTermQuery oddities

2009-08-12 Thread Michael McCandless
All Span*Query seem to rely on the SpanQuery.createWeight (which
returns SpanWeight/SpanScorer) to make their weight/scorer.
SpanScorer in turn simply enumerates all spans summing up their
"sloppy freq" and always scoring with that, regardless of the sub
queries.

So SpanNearQuery (or any composite span query, eg even SpanFirstQuery
I think will do this), disregards the scores of its child query/ies.

I agree it's odd... it seems like composite span queries ought to take
their child query scoring into account.  This would be a benefit of
merging into the normal Query*, since these composite queries already
factor in scoring from their sub queries.

Mike

On Wed, Aug 5, 2009 at 11:01 AM, Mark Miller wrote:
> Grant Ingersoll wrote:
>>
>> On Aug 5, 2009, at 10:07 AM, Mark Miller wrote:
>>>
>>> Yeah - SpanQuery's don't use the boosts from subspans - it just uses the
>>> idf for the query terms and the span length I believe - and the boost for
>>> the top level Query.
>>>
>>> Is that the right way to go? I guess Doug seemed to think so? I don't
>>> know. It is sort of a bug that lower boosts would be ignored right? There is
>>> an issue for it somewhere.
>>>
>>> It gets complicated quick to change it - all of a sudden you need
>>> something like BooleanQuery ...
>>>
>>
>> Not sure it needs BooleanQuery, but it does seem like it should take into
>> account the scores of the subclauses (regardless of BoostingTermQuery).
>>  There is a spot in creating the SpanScorer where it gets the value from the
>> QueryWeight, but this QueryWeight does not account for the subclauses
>> QueryWeights.
>>
>
> It doesn't need BooleanQuery - it needs BooleanQuery type logic - which is
> fairly complicated. At least to do it right I think. I don't have a clear
> memory of it, but I started to try and address this once and ...
> well I didn't continue.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-08-12 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1458:
--

Attachment: LUCENE-1458.patch

I took Mike's latest patch and updated it to current trunk.
It applies cleanly and compiles fine.

Some test cases fail. The problem is in SegmentReader in termsIndexIsLoaded() 
and loadTermsIndex(). I'll take a look tomorrow, I need to understand the 
latest changes we made in the different IndexReaders better (and now it's 
getting quite late here...)

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SpanQuery and Spans optimizations

2009-08-12 Thread Michael McCandless
I think being able to ask the Scorer for matching spans for the
current doc makes tons of sense.

I think eventually span queries should be absorbed into the normal
lucene queries.  EG, if TermQuery creates a scorer that's able to
optionally enumerate matching spans, such that there's no performance
loss if you don't actuallly request the spans, then we don't need
SpanTermQuery.

And once all Lucene queries can optionally provide their spans, then
highlighter becomes simpler since it can just ask the query's scorer
for the matching spans.

Mike

On Sat, Aug 8, 2009 at 4:10 AM, Shai Erera wrote:
> That would work. Though your custom TopSpansCollector should be able to
> handle other Scorers as well. And you can store the payloads in yet another
> custom ScoreDoc - is that what you had in mind?
>
> Shai
>
> On Sat, Aug 8, 2009 at 3:06 AM, Grant Ingersoll  wrote:
>>
>> On Aug 6, 2009, at 5:09 PM, Grant Ingersoll wrote:
>>
>>>
>>> On Aug 6, 2009, at 5:06 PM, Shai Erera wrote:
>>>
 Only w/ ScoreDocs we reuse the same instance. So I guess we'd like to do
 the same here.

 Seems like providing a TopSpansCollector is what you want, only unlike
 TopFieldCollector which populates the fields post search, you'd like to do
 it during search.
>>>
>>> Bingo, but I think the collection functionality needs to be on Collector,
>>> as I'd hate to have to lose out on functionality that the other impls have
>>> to offer, or have to recreate them.
>>>
>>
>> Hmm, maybe I can get at this info from the setScorer capabilities.  Then I
>> would just need a place to hang the data...  Maybe would just take having
>> the SpanScorer implementation provide just a wee bit more access to
>> structures...
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-08-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742270#action_12742270
 ] 

Michael McCandless commented on LUCENE-1748:


bq. I'm tempted to make Spans abstract. 

+1

> getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
> --
>
> Key: LUCENE-1748
> URL: https://issues.apache.org/jira/browse/LUCENE-1748
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4, 2.4.1
> Environment: all
>Reporter: Hugh Cayless
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1748.patch, LUCENE-1748.patch
>
>
> I just spent a long time tracking down a bug resulting from upgrading to 
> Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
> written against 2.3.  Since the project's SpanQuerys didn't implement 
> getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
> which returned null and caused a NullPointerException in the Lucene code, far 
> away from the actual source of the problem.  
> It would be much better for this kind of thing to show up at compile time, I 
> think.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Michael Busch
I think opaque terms is a good and useful feature and we have discussed 
that several times and experimentally implemented in the past.


However I think that should be separate discussion/feature request. It 
solves a different problem.


 Michael

On 8/12/09 1:51 AM, Shai Erera wrote:


Is there any example when you cannot use the processing phase for
that?


I actually meant that w/ the old QP I can also do it, by extending 
QueryParser and overriding "newWildcardQuery(Term)". I'm sure this can 
be done w/ the new QP as well. I just gave an example to something the 
new QP does not allow me to do more easily.


About the opaque clauses and '@' - usually I'd think it's not the user 
who writes such queries, but the application developer. Therefore the 
'@' does not really matter.


Without opaque clauses, if I want to add some ability, like Spatial 
search together w/ the other Lucene syntax, I will have a problem. I 
will need to copy the SyntaxParser and add Spatial syntax to it. And 
with that I lose whatever improvements that will be done on the 
default SyntaxParser.


We can not have the '@', but field::'some query', i.e., double colon 
(::) and query string surrounded w/ '. Maybe that will look more 
native to the user. We can perhaps have one colon (:) and ' to 
surround the query and change the field handling to recognize this is 
an opaque field (because of the '), but I don't know if this breaks 
the current syntax/parser.


Shai

On Wed, Aug 12, 2009 at 11:08 AM, Adriano Crestani 
mailto:adrianocrest...@gmail.com>> wrote:


If I want to control how Wildcard clauses are handled, I can do it
w/ today's QP as well, just extend it and override the appropriate
getter method.

The SyntaxParser can produce WildcardQueryNode object which can
further be processed on the processing phase. Is there any example
when you cannot use the processing phase for that?


In conclusion, I think that if we want to have a truly extensible
QP, we should start w/ the query syntax first, and my proposal are
those opaque terms.

Agree, I also think we need to improve a lot the syntax parsing
phase. It's really simple and not extensible yet. Opaque terms are
interesting, I just don't think users will like to type '@' before
the field names, actually the user has no idea why he's typing
that @, so there is no need for that. I think we could do a
mapping from field name to parser directly. Anyway, this approach
would only work for field:term syntaxes, any other different
syntax, like xml syntax, will need a different approach. I cannot
think about a generic API yet for this approach, any suggestions?


On Wed, Aug 12, 2009 at 12:54 AM, Shai Erera mailto:ser...@gmail.com>> wrote:

If I want to control how Wildcard clauses are handled, I can
do it w/ today's QP as well, just extend it and override the
appropriate getter method.







Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Adriano Crestani
We can perhaps have one colon (:) and ' to surround the query and change the
field handling to recognize this is an opaque field (because of the '), but
I don't know if this breaks the current syntax/parser.

I think this way is cleaner :)

On Wed, Aug 12, 2009 at 1:51 AM, Shai Erera  wrote:

> We can perhaps have one colon (:) and ' to surround the query and change
> the field handling to recognize this is an opaque field (because of the '),
> but I don't know if this breaks the current syntax/parser.
>


Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Shai Erera
>
> Is there any example when you cannot use the processing phase for that?
>

I actually meant that w/ the old QP I can also do it, by extending
QueryParser and overriding "newWildcardQuery(Term)". I'm sure this can be
done w/ the new QP as well. I just gave an example to something the new QP
does not allow me to do more easily.

About the opaque clauses and '@' - usually I'd think it's not the user who
writes such queries, but the application developer. Therefore the '@' does
not really matter.

Without opaque clauses, if I want to add some ability, like Spatial search
together w/ the other Lucene syntax, I will have a problem. I will need to
copy the SyntaxParser and add Spatial syntax to it. And with that I lose
whatever improvements that will be done on the default SyntaxParser.

We can not have the '@', but field::'some query', i.e., double colon (::)
and query string surrounded w/ '. Maybe that will look more native to the
user. We can perhaps have one colon (:) and ' to surround the query and
change the field handling to recognize this is an opaque field (because of
the '), but I don't know if this breaks the current syntax/parser.

Shai

On Wed, Aug 12, 2009 at 11:08 AM, Adriano Crestani <
adrianocrest...@gmail.com> wrote:

> If I want to control how Wildcard clauses are handled, I can do it w/
> today's QP as well, just extend it and override the appropriate getter
> method.
>
> The SyntaxParser can produce WildcardQueryNode object which can further be
> processed on the processing phase. Is there any example when you cannot use
> the processing phase for that?
>
> In conclusion, I think that if we want to have a truly extensible QP, we
> should start w/ the query syntax first, and my proposal are those opaque
> terms.
>
> Agree, I also think we need to improve a lot the syntax parsing phase. It's
> really simple and not extensible yet. Opaque terms are interesting, I just
> don't think users will like to type '@' before the field names, actually the
> user has no idea why he's typing that @, so there is no need for that. I
> think we could do a mapping from field name to parser directly. Anyway, this
> approach would only work for field:term syntaxes, any other different
> syntax, like xml syntax, will need a different approach. I cannot think
> about a generic API yet for this approach, any suggestions?
>
>
> On Wed, Aug 12, 2009 at 12:54 AM, Shai Erera  wrote:
>
>> If I want to control how Wildcard clauses are handled, I can do it w/
>> today's QP as well, just extend it and override the appropriate getter
>> method.
>>
>
>


Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Adriano Crestani
If I want to control how Wildcard clauses are handled, I can do it w/
today's QP as well, just extend it and override the appropriate getter
method.

The SyntaxParser can produce WildcardQueryNode object which can further be
processed on the processing phase. Is there any example when you cannot use
the processing phase for that?

In conclusion, I think that if we want to have a truly extensible QP, we
should start w/ the query syntax first, and my proposal are those opaque
terms.

Agree, I also think we need to improve a lot the syntax parsing phase. It's
really simple and not extensible yet. Opaque terms are interesting, I just
don't think users will like to type '@' before the field names, actually the
user has no idea why he's typing that @, so there is no need for that. I
think we could do a mapping from field name to parser directly. Anyway, this
approach would only work for field:term syntaxes, any other different
syntax, like xml syntax, will need a different approach. I cannot think
about a generic API yet for this approach, any suggestions?

On Wed, Aug 12, 2009 at 12:54 AM, Shai Erera  wrote:

> If I want to control how Wildcard clauses are handled, I can do it w/
> today's QP as well, just extend it and override the appropriate getter
> method.
>


[jira] Resolved: (LUCENE-1803) Wrong javadoc on LowerCaseTokenizer.normalize

2009-08-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1803.
---

   Resolution: Fixed
Fix Version/s: 2.9

I just committed this. Thanks!
(revision: 803404)

> Wrong javadoc on LowerCaseTokenizer.normalize
> -
>
> Key: LUCENE-1803
> URL: https://issues.apache.org/jira/browse/LUCENE-1803
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs
>Reporter: Bernd Fondermann
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LowerCaseTokenizer_javadoc.patch
>
>
> The javadoc on LowerCaseTokenizer.normalize seems to be copy/paste from 
> LetterTokenizer.isTokenChar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1803) Wrong javadoc on LowerCaseTokenizer.normalize

2009-08-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-1803:
-

Assignee: Uwe Schindler

> Wrong javadoc on LowerCaseTokenizer.normalize
> -
>
> Key: LUCENE-1803
> URL: https://issues.apache.org/jira/browse/LUCENE-1803
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs
>Reporter: Bernd Fondermann
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LowerCaseTokenizer_javadoc.patch
>
>
> The javadoc on LowerCaseTokenizer.normalize seems to be copy/paste from 
> LetterTokenizer.isTokenChar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Shai Erera
Michael, I wrote the above reply before I noticed you already replied.
Thanks for the explanation.

I guess that the way I see it, being able to extend a SyntaxParser is more
important than building my final Query object. If I want to enhance the
query syntax by replacing [] {} w/ <= and >=. How do I do that? I should
still copy the entire SyntaxParser logic, and modify these two places,
right?

By copying, I also copy existing bugs. If say someone fixes a bug which is
not related at all to the change above, how do I merge it? By default I
won't inherit it, and I'll need to manually apply the patch to my now
private version of SyntaxParser, even though 90% of it is still the original
Lucene parser.

And if someone augments the current syntax or parser w/ better parsing of
boolean queries. How do I take advantage of that?

W/ the opaque terms above, I should not have any problem w/ that. If I want
to have a Spatial syntax, I write a Spatial parser. If I want to have a
different range queries syntax, I write my own Range parser. If someone
fixes a bug in core parser, or improves how other sections of the syntax are
handled, I get those for free, because I never touch the parser's logic.

The new QP, AFAIU, does not help me in this case. All it gives me are some
helper classes (Builders, Processors maybe). But my core problem is the
syntax. If I want to control how Wildcard clauses are handled, I can do it
w/ today's QP as well, just extend it and override the appropriate getter
method.

In conclusion, I think that if we want to have a truly extensible QP, we
should start w/ the query syntax first, and my proposal are those opaque
terms. Then, we can have a discussion about whether the new QP allows us to
support it more easily or not. If say we have an interface QueryParser w/ a
single parse(String) method which returns a Query object. Do we really care
how this QP was written? Whether it uses the new QP framework or something
else?

Shai

On Wed, Aug 12, 2009 at 10:43 AM, Shai Erera  wrote:

> With the new QP we can build out a syntax that's compatible with
>> GData and be able to embed location/spatial queries directly
>> into the query string. (i.e. @+40.75-074.00 + 5mi)
>>
>
> What do you mean "with the new QP"? What prevents you from doing that w/o
> the new QP, as in writing your own QP? What are the benefits the new QP has
> when you come to deal w/ such terms? Unless you're talking about extending
> the Lucene syntax w/ spatial clauses. Just for my education, how do you
> extend the new QP w/ this information? Can you extend the Tokenizer, or do
> you need to write a new one?
>
> I'm trying to separate between the query syntax and a QP. The new QP is
> more of a framework for how to parse queries. It's well architected and
> designed. It allows to build different QPs for different syntaxes easily.
>
> As for the query syntax, what if we had augmented Lucene query syntax w/
> opaque clauses support. Something like @qpname::'query string'. Then, we can
> add to a QP a QP mapping from qpname to QP instance. That would allow anyone
> to use Lucene's QP and write new QPs (however they want) to match different
> opaque clauses.
>
> For the example above, I could write this query: "restaurants 
> @spatial::'@+40.75-074.00
> + 5mi' " (quotes are not part of the query string) and instantiate the QP as
> follows:
> QueryParser qp = new QueryParser();
> qp.addQueryParser("spatial", new SpatialQueryParser());
> qp.parse(queryString);
>
> Upon parsing, the default QP would hit the opaque clause and defer parsing
> of the text in between ' to SpatialQueryParser. We'd need to come up w/ a
> simple QP interface, with a parse() method or something that it can call.
> Nothing too fancy.
>
> SpatialQueryParser could be implemented however we choose. Not necessarily
> using the new QP framework.
>
> Maybe we should add this to Lucene anyway, and the new QP would just make
> the implementations easier.
>
> BTW, in case I managed to make a wrong impression - I'm not against the new
> QP :).
>
> Shai
>
>
> On Wed, Aug 12, 2009 at 8:53 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> With the new QP we can build out a syntax that's compatible with
>> GData and be able to embed location/spatial queries directly
>> into the query string. (i.e. @+40.75-074.00 + 5mi)
>>
>> SQL like range queries (i.e. [megapixel >= 3.0])
>>
>> On Tue, Aug 11, 2009 at 10:44 PM, Jason
>> Rutherglen wrote:
>> > I'm starting to use the new parser to emulate Google's queries
>> > (i.e. a phrase query with a single term means no-stemming,
>> > something the current QP doesn't allow because it converts the
>> > quoted query into a term query inside the JavaCC portion). It's
>> > been very straightforward and logical to use (so far).
>> >
>> > Thanks to the contrib query parser team!
>> >
>> > On Tue, Aug 11, 2009 at 10:54 AM, Mark Miller
>> wrote:
>> >> I don't think we should stick with the current path of replacing the
>> current
>> 

Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Adriano Crestani
Some comments in line:

The new QueryParser has not been used much at all yet. Its interfaces (which
will need to abide by back compat in core) have not been vetted enough.

Agreed, I still think some points must still be discussed about the API, and
to start discussing about it, the contributors must have a deeper understand
about the main points of the new QP, so further we can discuss what must be
adjusted. So, I think it should stay longer on contrib as experimental and
as people start (and they already started [?] ) using it we will have a
feedback from them.

The new parser appears to add complication to some of things that were very
simple with the old parser.

Some things could be easily done with the old QP, because it was designed
specificily for that. For example, you can easily change how RangeQuery
objects are created, because the old QP provides a method for that.
Otherwise, it becomes very difficult to maintain, add extra processing, plug
and unplug functionality, separate syntax from semantic. On the company I
work for, we used to use the old one, just extending it. There was always a
new requirement and the code had to be changed, there was IF statements and
control variables all over the place, I wonder if other companies are
running into the same problems using the old QP. My conclusion here is:
things that are simple are usually not very powerfull, flexible and
mantainable.

This new QP that was contributed to Lucene can be split in 2 parts: core and
lucene QP implementation.

The core contains the QP framework classes, the framework tries to define
the best way to implement and QP, so it's easily mantainable and
extensible/flexible if you follow the rules. Of course the user can just
ignore what the framework suggest, like to do processing at buiding time, or
parsing at processing time and etc, it does not prohibit that. Maybe the
framework classes are not finished yet, they might require more work to get
it in a better shape, at the same time I think it's ready for a release,
since you can already write a complete QP with it. I would like to suggest
that the core classes to be included in 3.0 lucene core.

The lucene QP implementation is the old QP implementation applied against
the new framework rules. The syntax and semantic were separated and now on
any new functionality or change can be easily perfomed. OK, maybe this new
implementation should be in contrib and whether the user thinks it's better
than the old one.

On Tue, Aug 11, 2009 at 10:53 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> With the new QP we can build out a syntax that's compatible with
> GData and be able to embed location/spatial queries directly
> into the query string. (i.e. @+40.75-074.00 + 5mi)
>
> SQL like range queries (i.e. [megapixel >= 3.0])
>
> On Tue, Aug 11, 2009 at 10:44 PM, Jason
> Rutherglen wrote:
> > I'm starting to use the new parser to emulate Google's queries
> > (i.e. a phrase query with a single term means no-stemming,
> > something the current QP doesn't allow because it converts the
> > quoted query into a term query inside the JavaCC portion). It's
> > been very straightforward and logical to use (so far).
> >
> > Thanks to the contrib query parser team!
> >
> > On Tue, Aug 11, 2009 at 10:54 AM, Mark Miller
> wrote:
> >> I don't think we should stick with the current path of replacing the
> current
> >> QueryParser with the new contrib QueryParser in Lucene 3.0.
> >>
> >> The new QueryParser has not been used much at all yet. Its interfaces
> (which
> >> will need to abide by back compat in core) have not been vetted enough.
> >>
> >> The new parser appears to add complication to some of things that were
> very
> >> simple with the old parser.
> >>
> >> The main benefits of the new parser are claimed to be the ability to
> plug
> >> and play many syntaxes and QueryBuilders. This is not an end user
> benefit
> >> though and I'm not even sure how much of a benefit it is to us. There is
> >> currently only one impl. It seems to me, once you start another impl,
> its a
> >> long shot that the exact same query tree representation is going to work
> >> with a completely different syntax. Sure, if you are just doing postfix
> >> rather than prefix, it will be fine – but the stuff that would likely be
> >> done – actual new syntaxes – are not likely to be very pluggable. If a
> >> syntax can map to the same query tree, I think we would likely stick to
> a
> >> single syntax – else suffer the confusion and maintenance headaches for
> >> syntactic sugar. More than a well factored QueryParser that can more
> easily
> >> allow different syntaxes to map to the same query tree representation, I
> >> think we just want a single solid syntax for core Lucene that supports
> Spans
> >> to some degree. We basically have that now, sans the spans support.
> Other,
> >> more exotic QueryParsers should live in contrib, as they do now.
> >>
> >> Which isn't to say this QueryParser should not one 

Re: The new Contrib QueryParser should not be slated to replace the old one yet

2009-08-12 Thread Shai Erera
>
> With the new QP we can build out a syntax that's compatible with
> GData and be able to embed location/spatial queries directly
> into the query string. (i.e. @+40.75-074.00 + 5mi)
>

What do you mean "with the new QP"? What prevents you from doing that w/o
the new QP, as in writing your own QP? What are the benefits the new QP has
when you come to deal w/ such terms? Unless you're talking about extending
the Lucene syntax w/ spatial clauses. Just for my education, how do you
extend the new QP w/ this information? Can you extend the Tokenizer, or do
you need to write a new one?

I'm trying to separate between the query syntax and a QP. The new QP is more
of a framework for how to parse queries. It's well architected and designed.
It allows to build different QPs for different syntaxes easily.

As for the query syntax, what if we had augmented Lucene query syntax w/
opaque clauses support. Something like @qpname::'query string'. Then, we can
add to a QP a QP mapping from qpname to QP instance. That would allow anyone
to use Lucene's QP and write new QPs (however they want) to match different
opaque clauses.

For the example above, I could write this query: "restaurants
@spatial::'@+40.75-074.00
+ 5mi' " (quotes are not part of the query string) and instantiate the QP as
follows:
QueryParser qp = new QueryParser();
qp.addQueryParser("spatial", new SpatialQueryParser());
qp.parse(queryString);

Upon parsing, the default QP would hit the opaque clause and defer parsing
of the text in between ' to SpatialQueryParser. We'd need to come up w/ a
simple QP interface, with a parse() method or something that it can call.
Nothing too fancy.

SpatialQueryParser could be implemented however we choose. Not necessarily
using the new QP framework.

Maybe we should add this to Lucene anyway, and the new QP would just make
the implementations easier.

BTW, in case I managed to make a wrong impression - I'm not against the new
QP :).

Shai

On Wed, Aug 12, 2009 at 8:53 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> With the new QP we can build out a syntax that's compatible with
> GData and be able to embed location/spatial queries directly
> into the query string. (i.e. @+40.75-074.00 + 5mi)
>
> SQL like range queries (i.e. [megapixel >= 3.0])
>
> On Tue, Aug 11, 2009 at 10:44 PM, Jason
> Rutherglen wrote:
> > I'm starting to use the new parser to emulate Google's queries
> > (i.e. a phrase query with a single term means no-stemming,
> > something the current QP doesn't allow because it converts the
> > quoted query into a term query inside the JavaCC portion). It's
> > been very straightforward and logical to use (so far).
> >
> > Thanks to the contrib query parser team!
> >
> > On Tue, Aug 11, 2009 at 10:54 AM, Mark Miller
> wrote:
> >> I don't think we should stick with the current path of replacing the
> current
> >> QueryParser with the new contrib QueryParser in Lucene 3.0.
> >>
> >> The new QueryParser has not been used much at all yet. Its interfaces
> (which
> >> will need to abide by back compat in core) have not been vetted enough.
> >>
> >> The new parser appears to add complication to some of things that were
> very
> >> simple with the old parser.
> >>
> >> The main benefits of the new parser are claimed to be the ability to
> plug
> >> and play many syntaxes and QueryBuilders. This is not an end user
> benefit
> >> though and I'm not even sure how much of a benefit it is to us. There is
> >> currently only one impl. It seems to me, once you start another impl,
> its a
> >> long shot that the exact same query tree representation is going to work
> >> with a completely different syntax. Sure, if you are just doing postfix
> >> rather than prefix, it will be fine – but the stuff that would likely be
> >> done – actual new syntaxes – are not likely to be very pluggable. If a
> >> syntax can map to the same query tree, I think we would likely stick to
> a
> >> single syntax – else suffer the confusion and maintenance headaches for
> >> syntactic sugar. More than a well factored QueryParser that can more
> easily
> >> allow different syntaxes to map to the same query tree representation, I
> >> think we just want a single solid syntax for core Lucene that supports
> Spans
> >> to some degree. We basically have that now, sans the spans support.
> Other,
> >> more exotic QueryParsers should live in contrib, as they do now.
> >>
> >> Which isn't to say this QueryParser should not one day rule the roost –
> but
> >> I don't think its earned the right yet. And I don't think there is a
> hurry
> >> to toss the old parser.
> >>
> >> Personally, I think that the old parser should not be deprecated. Lets
> let
> >> the new parser breath in contrib for a bit. Lets see if anyone actually
> adds
> >> any other syntaxes. Lets see if the pluggability results in any
> >> improvements. Lets see if some of the harder things to do (overriding
> query
> >> build methods?) become easier or keep people from using the n

[jira] Commented: (LUCENE-533) SpanQuery scoring: SpanWeight lacks a recursive traversal of the query tree

2009-08-12 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742249#action_12742249
 ] 

Paul Elschot commented on LUCENE-533:
-

I see I missed the introduction of payloads into Spans. As back compat is 
broken anyway, one might as well get rid of the Spans interface completely and 
make Spans an abstract class.
Since it is only the interface that is in the way of changes, any way to get 
rid of the Spans as an interface is ok with me.

Payloads can be yet another way to introduce a (term/spans) weight, so one 
might subclass these from WeightedSpans:
Spans -> WeightedSpans -> PayloadSpans.
That would also allow to use WeightedSpans inside an object hierarchy for 
scoring nested span queries, and to use PayloadSpans as a leafs.

Scoring nested span queries is not trivial, and allowing a weight on each spans 
does not make it simpler, but at least it would allow span queries to behave 
more like boolean queries.

> SpanQuery scoring: SpanWeight lacks a recursive traversal of the query tree
> ---
>
> Key: LUCENE-533
> URL: https://issues.apache.org/jira/browse/LUCENE-533
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.9
>Reporter: Vincent Le Maout
>Priority: Minor
>
> I found the computing of weights to be somewhat different according to the 
> query type (BooleanQuery versus SpanQuery) :
> org.apache.lucene.search.BooleanQuery.BooleanWeight :
> public BooleanWeight(Searcher searcher)
>  throws IOException {
>  this.similarity = getSimilarity(searcher);
>  for (int i = 0 ; i < clauses.size(); i++) {
>BooleanClause c = (BooleanClause)clauses.elementAt(i);
>weights.add(c.getQuery().createWeight(searcher));
>  }
>}
> which looks like a recursive descent through the tree, taking into account 
> the weights of all the nodes, whereas :
> org.apache.lucene.search.spans.SpanWeight :
> public SpanWeight(SpanQuery query, Searcher searcher)
>throws IOException {
>this.similarity = query.getSimilarity(searcher);
>this.query = query;
>this.terms = query.getTerms();
>idf = this.query.getSimilarity(searcher).idf(terms, searcher);
>  }
> lacks any traversal and according to what I have understood so far from the 
> rest
> of the code, only takes into account the boost of the tree root in 
> SumOfSquareWeights(),
> which is consistent with the resulting scores not considering the boost of 
> the tree
> leaves.
> vintz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-12 Thread Michael Busch



+1. We don't use Solr, but have quite a bunch of medium and
short-sized documents. Plus heaps of metadata fields.

I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by
   


Did you read it yet? What do you think about it?


some of you. My gripe with new API is not that it brings us troubles
(which are solved one way or another), it is that the switch and
associated migration costs bring zero benefits in immediate and remote
future.
The only person that tried to disprove this claim is Uwe. Others
either say "the problems are solved, so it's okay to move to the new
API", or "this will be usable when flexindexing arrives". Sorry, the
last phrase doesn't hold its place, this API is orthogonal to
flexindexing, or at least nobody has shown the opposite.
   


If the API is orthogonal to flexible indexing or not depends on how you 
define "flexible indexing". I admit the term is vague and probably 
nowhere clearly defined.


I agree that if flexible indexing means to only change the encoding, 
i.e. *how* data is stored, e.g. PFOR vs. the current posting format, 
then yes, we don't need the new TokenStream API for it.


But the goals we have with flexible indexing are more than that. We want 
to allow customizing *what* data is stored in the inverted index. The 
very first discussion about flexible indexing that happened several 
years ago you can find in the wiki: 
http://wiki.apache.org/lucene-java/FlexibleIndexing.


Already in this very early proposal it was suggested to have the 
following posting formats as a start:

a. +
b. +
c. + >+
d. + >+

For d. you need to change the TokenStream API. How else can we get the 
boost from the source to the indexer. Of course you can always serialize 
the additional data into the payload byte array, but if filters want to 
do something with it performance suffers. The new API solves this 
problem very nicely. When we open the posting format like this people 
will want to store different custom things in there. The new TokenStream 
API is prepared for that - the old one isn't.


 Michael


So, what I'm arguing against is adding some code (and forcing users to
migrate) just because we can, with no other reasons.

   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org