date:20090617

[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-17 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

adds tests for thaianalyzer token offsets and types, both of which have bugs!
tests for correct behavior are included but commented out.


> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt
>
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Sorry, last patch was invalid (did not compile), I forgot to to revert some 
changes before posting.
Attached patch has still problems in TeeTokenStream, SinkTokenizer and 
CachingTokenFilter (see before), but fixes:
- double cloning of payloads
- the first of your tests works correct, even if i remove next() from 
StopFilter and/or LowercaseFilter

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsub

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: (was: LUCENE-1693.patch)

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Synchronizing Lucene indexes across 2 application servers

2009-06-17 Thread mitu2009


I've a web application which uses Lucene for search functionality. Lucene
search requests are served by web services sitting on 2 application servers
(IIS 7).The 2 application servers are Load balanced using "netscaler".

Both these servers have a batch job running which updates search indexes on
the respective servers in the night on a daily basis.

I need to synchronize search indexes on these 2 servers so that at any point
of time both the servers have uptodate indexes. I was thinking what could be
the best architecture/design strategy to do so given the fact that any of
the 2 application servers could be serving search request depending upon its
availability.

Any inputs please?

Thanks for reading!
-- 
View this message in context: 
http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721078#action_12721078
 ] 

Shai Erera commented on LUCENE-1595:


bq. I still want to run some tests with the wikipedia stuff

I added readContentSource.alg just for that purpose and ran it over the 
Wikipedia dump. All documents were read successfully.

bq. Removed modification to core Document class

Nice ! I don't know how I missed that getFields().clear() option.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
> LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721061#action_12721061
 ] 

Mark Miller commented on LUCENE-1628:
-

Okay, fair enough. I figured you'd know better than me, just wanted to check. 
Certainly if we have other code that way, no reason to change it here. And of 
course it makes sense that you would still run into issues with the comments - 
garbalage at best.

I only ever use apply to/from clipboard so I have luckily never seen that issue 
:)

We should be good to put this in then - I'll wait till we get squared away with 
the new token api patch then commit.

> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-17 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721057#action_12721057
 ] 

Robert Muir commented on LUCENE-1628:
-

mark: thanks for the followup on the licenses!

wrt non-english text, I will say that if you set encoding to UTF-8 (such as in 
eclipse under project>properties>text encoding) then things are fine.
the ant build also does the right thing, and there are definitely other 
analyzers that behave like this too, and will break if things aren't set right.

also, if you do not set encoding to UTF-8, most editors (such as eclipse) will 
not be able to save the file, and will error out with encoding issues... even 
if the text is inside a comment!

not really (ok a little) trying to talk you out of this, but I'm just not sure 
it would really help anything...

that being said... (my) eclipse still jacks up if you team->apply patch from 
file. if you open the patch in notepad, ctrl-a,ctrl-c, and then team->apply 
patch from clipboard, it works fine... very annoying!


> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721054#action_12721054
 ] 

Mark Miller commented on LUCENE-1696:
-

Patch looks good! I'll just hold off till the token api improvement patch is 
finished, just in case we need to make an adjustment here.

> Added New Token API impl for ASCIIFoldingFilter
> ---
>
> Key: LUCENE-1696
> URL: https://issues.apache.org/jira/browse/LUCENE-1696
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
> TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
> extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about 
> this filter. ASCIIFoldingFitler is meant to be a replacement for 
> ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
> latter. I have used this filter quite often but never on a as it is basis. In 
> the most cases this filter does the correct thing (replace a special char 
> with its ascii correspondent) but in some cases like for German umlaut it 
> does not return the expected result. A german umlaut  like 'ä' does not 
> translate to a but rather to 'ae'. I would like to change this but I'n not 
> 100% sure if that is expected by all users of that filter. Another way of 
> doing it would be to make it configurable with a flag. This would not affect 
> performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the 
> original/unmodified token with the same position increment into the token 
> stream on demand. I think its a valid use-case to index the modified and 
> unmodified token. For instance, the german word "süd" would be folded to 
> "sud". In a query q:(süd) the filter would also fold to sud and therefore 
> find sud which has a totally different meaning. Folding works quite well but 
> for special cases would could add those options to make users life easier. 
> The latter could be done in a subclass while the umlaut problem should be 
> fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721052#action_12721052
 ] 

Mark Miller commented on LUCENE-1599:
-

Well yuck.

SpanNearQuery does this clone call in its rewrite method but there is no clone 
impl - so it looks like it returns a SpanNearQuery with the same clauses 
instance. So it looks like this gets tangled up with the real query, and the 
real query gets modified to the rewritten form for the rewrite on searchable2.

I think anyway. I wanted to just test a fix to if that was right, but 
SpanNearQuery can contain any span queries, so I guess all of them might need 
clone impls and we may have to clone the whole chain?

A little tired to think about it at the moment ;) Looks like the issue is with 
the cloning in SpanNearQuery though.

> SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
> --
>
> Key: LUCENE-1599
> URL: https://issues.apache.org/jira/browse/LUCENE-1599
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4.1
> Environment: lucene-core 2.4.1, lucene-regex 2.4.1
>Reporter: Billow Gao
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: TestSpanRegexBug.java
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> MultiSearcher is using:
> queries[i] = searchables[i].rewrite(original);
> to rewrite query and then use combine to combine them.
> But SpanRegexQuery's rewrite is different from others.
> After you call it on the same query, it always return the same rewritten 
> queries.
> As a result, only search on the first IndexSearcher work. All others are 
> using the first IndexSearcher's rewrite queries.
> So many terms are missing and return unexpected result.
> Billow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721048#action_12721048
 ] 

Mark Miller commented on LUCENE-1628:
-

Looks pretty good. Not sure if we should update to the new token api here or 
just commit and hit it with the other issue. I guess we might as well get it 
here first.

Is it better to put the raw text in there like that (in the tests) or do you 
think it would be better to use utf8 codes with maybe the raw text in a 
comment? I'm just remembering running into issues with such things in a past 
life as I moved around source code.

> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721044#action_12721044
 ] 

Mark Miller commented on LUCENE-1628:
-

bq. mark, on the same topic: if possible, at some time it would be great to 
know which licenses are OK, and which ones are not. 

Found it.

No Problem:
* Apache License 2.0
* ASL 1.1
* BSD
* MIT/X11
* NCSA
* W3C Software license
* X.Net
* zlib/libpng

with some hassle:
* CDDL 1.0
* CPL 1.0
* EPL 1.0
* IPL 1.0
* MPL 1.0 and MPL 1.1
* SPL 1.0

http://www.apache.org/legal/3party.html

> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1695) Update the Highlighter to use the new TokenStream API

2009-06-17 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1695:


Attachment: LUCENE-1695.patch

Pretty much done, all tests pass. It breaks back compat, but frankly, 
straddling doesn't seem worth the effort here. Or even very possible. You can't 
really give new methods to use for the deprecated ones, and deprecating by 
class would be a real nuisance as we would lose class names I'd rather keep. We 
have no back compat policy, and I think its worth just pushing this to the new 
API.

I was also thinking about breaking back compat with changing the Highlighter to 
use the SpanScorer, so doing it all in one shot would be nice. The overall 
migration should be fairly simple once you understand the new TokenFilter API. 
I'll handle it for Solr.

Still needs either its own changes file to explain or could go in the contrib 
common changes file.

There is a change to the MemoryIndex to get around issues with the new/old API 
and CachingTokenFilters.

Ill have to see how the new TokenFilter API improvements issue works out before 
doing a final patch for this.

> Update the Highlighter to use the new TokenStream API
> -
>
> Key: LUCENE-1695
> URL: https://issues.apache.org/jira/browse/LUCENE-1695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1695.patch, LUCENE-1695.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-17 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1595:


Attachment: LUCENE-1595.patch

Added to changes a bit
Removed modification to core Document class
updated deletepercent.alg to new alg changes
fixed a couple comment typos
set to use content.source.forever rather than doc.maker.forever in 
ExtractWikipedia#main(String[] args)
the sort algs don't work :( unrelated to this patch and related to our 
deprecation of the auto sort field - Ryan just hit that over in solr-land too.

I still want to run some tests with the wikipedia stuff, but still waiting for 
that mondo file to download :)

Looks pretty nice overall, thanks Shai!

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
> LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader

2009-06-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721024#action_12721024
 ] 

Jason Rutherglen commented on LUCENE-1700:
--

Taking a step back, maybe we can solve the package protected
SegmentInfo issue here by creating a new class with the
necessary attributes?

Here's what LUCENE-1313 does:

{code} SegmentReader sr = writer.readerPool.getIfExists(info);
if (info.hasDeletions() || (sr != null && sr.hasDeletions())) {
{code}

Because SegmentInfo is package protected it seems ok to access a
package protected method (or in this case variable) in
IndexWriter.

> LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the 
> SegmentReader
> 
>
> Key: LUCENE-1700
> URL: https://issues.apache.org/jira/browse/LUCENE-1700
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Trivial
> Fix For: 2.9
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> With LUCENE-1516, deletes are carried over in the SegmentReaders
> which means implementations of
> MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy)
> need to obtain deletion info from the SR (instead of from the
> SegmentInfo which won't have the information).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader

2009-06-17 Thread Jason Rutherglen (JIRA)

LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the 
SegmentReader


 Key: LUCENE-1700
 URL: https://issues.apache.org/jira/browse/LUCENE-1700
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 2.9


With LUCENE-1516, deletes are carried over in the SegmentReaders
which means implementations of
MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy)
need to obtain deletion info from the SR (instead of from the
SegmentInfo which won't have the information).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1313) Near Realtime Search

2009-06-17 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1313:
-

Attachment: LUCENE-1313.patch

The patch is cleaned up. A static variable IndexWriter.GLOBALNRT
is added, which allows all the tests to be run with
flushToRAM=true. I reran the tests which hopefully still work as
intended. Tests that looked for specific file names were changed
to work with NRT. Some of the tests are skipped entirely and
need to be written specifically for flushToRAM. 

* TestIndexWriterMergePolicy,TestBackwardsCompatibility failures
are expected

* TestIndexWriterRAMDir.testFSDirectory fails (will be fixed)

* TestThreadedOptimize ensureContiguousMerge fails. This one is
a bit mysterious, perhaps the correct assertion will show where
it's going wrong. 

I need to go through and mark the tests that can be converted to
be NRT specific. 

> Near Realtime Search
> 
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Enable near realtime search in Lucene without external
> dependencies. When RAM NRT is enabled, the implementation adds a
> RAMDirectory to IndexWriter. Flushes go to the ramdir unless
> there is no available space. Merges are completed in the ram
> dir until there is no more available ram. 
> IW.optimize and IW.commit flush the ramdir to the primary
> directory, all other operations try to keep segments in ram
> until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720984#action_12720984
 ] 

Michael Busch commented on LUCENE-1693:
---

Go to bed, I'll review later... in meetings now...

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Small updates, before I go to sleep. This patch removes the incrementTokenAPI 
from the three caching classes. It also fixes the double cloning of the payload 
in next() when the token is cloned directly.
There is still one small problem, that your test -- I hate it... :-( -- fails 
again, if I remove next(Token) from StopFilter or LowerCaseFilter.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsub

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720952#action_12720952
 ] 

Uwe Schindler commented on LUCENE-1693:
---

The second test does not work, because it always uses per default 
incrementToken.

By the way, the APIdocs and behaviour changed with these three classes, 
TeeTokenFilter, SinkTokenizer and CachingTokenFilter: e.g. getTokens() does not 
return what is noted. For backwards-compatiblility we should deprecate the 
current versions of these class [and only let them implement next(Token)]. They 
can then be used even together with the new API, but they always work on Token 
instances. When I remove incrementToken from them your test passes complete.
For the new API there should be new classes, that use attributesource and 
restorestate to cache and so on.

But for current backwards compatibility (you mentioned, somebody have written a 
similar thing): If the user's class only uses next(Token) it will work as 
before. The problem is mixed implementations of old/new API and different cache 
contents. This is not a problem of my proposal!

Again: We should remove the double implementations everywhere. In these special 
cases with caches, where the cache should contain a specific class (Tokens or 
AttributeSource.State), two classes are needed, one deprecated.

But: what do you think about my latest patch in general?

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720926#action_12720926
 ] 

Uwe Schindler commented on LUCENE-1693:
---

Exactly: The problem is in SinkTokenizer. when calling next(Token)  the result 
is casted to Token, which does not work (the iterator only contains either 
Tokens or States, dependent on what was added. As SinkTokenizer and 
TeeTokenFilter may use different APIs it crahes.
The problem with the test is, that depending on chaining with old/new APIs the 
iter may conatin wron type. This can be fixed by removing next(Token) 
(preferred) or incrementToken() . The problem is that dependent on chaining it 
is not clear which method is called and the new/old API should not share the 
same state information.

Because the problem is related new/old API, we should simply remove the old API 
from both filters, so they share the same instances in all cases! Then we do 
not need UOE.

I will look into and check, why the Token in the second test is not preserverd

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720919#action_12720919
 ] 

Michael Busch commented on LUCENE-1693:
---

Btw: SinkTokenizer in my patch has a small bug too. I need to throw a UOE in 
incrementToken() if it was filled using the old API.

It should probably also throw a UOE when someone tries to fill it with both, 
old and new API streams. And that this is not allowed must be made clear in the 
javadocs.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720913#action_12720913
 ] 

Michael Busch commented on LUCENE-1693:
---

You can probably fix CachingTokenFilter and tee/sink to behave correctly. But 
please remember that a user might have their own implementations of something 
like a CachingTokenFilter or tee/sink, which must keep working.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1693:
--

Attachment: TestCompatibility.java

Slightly changes tool yields on 2.4 and identically on trunk + my patch:

{noformat}
new
tokenstream --> proper noun
api
new
tokenstream --> proper noun
api
new
tokenstream
api
{noformat}

On trunk + your latest patch:
{noformat}
new
tokenstream --> proper noun
api
new
tokenstream
api
Exception in thread "main" java.lang.ClassCastException: 
org.apache.lucene.util.AttributeSource$State
at org.apache.lucene.analysis.SinkTokenizer.next(SinkTokenizer.java:97)
at 
org.apache.lucene.analysis.TestCompatibility.consumeStream(TestCompatibility.java:97)
at 
org.apache.lucene.analysis.TestCompatibility.main(TestCompatibility.java:90)

{noformat}

It runs three tests. The first is good with your patch; the second doesn;t seem 
to preserve the right Token subclass; the third throws a ClassCastException. I 
haven't debugged why...

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should s

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Here my solution: The three default methods are now optimized to use the 
shortest path to the by subclasses implemented iteration method. The 
implemented iteration methods are determined by reflection in initialize().
Cloning now only done, if next() is directly called by a consumer, in all other 
cases the reuseableToken is used for passing the attributes around.

The new TokenStream also checks in initialize, that one of the "abstract" 
methods is overridden. Because of this TestIndexWriter and the inverter 
singleton state was updated to at least have an empty incrementToken(). Because 
of this check, nobody can create a TokenStream, that loops indefinite after 
calling next() because no pseuso-abstract method was overridden. As 
incrementToken will be abstract in future, it must always be implemented, and 
this is what I have done.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass,

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720887#action_12720887
 ] 

Michael Busch commented on LUCENE-1693:
---

I'm not convinced yet that we will be able to remove the implementations of 
next() and next(Token). 
Mark, I'm not familiar with what changes you need to make to the highlighter, 
but you should not rely yet on the fact that next() and nextToken() won't have 
to be implemented anymore.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


Michael Busch wrote:


Everyone who is unhappy with the release TODO's, go back in your mail 
archive to the 2.2 release and check how many tedious little changes 
we made to improve the release quality. And besides the maven stuff, 
there is not really more to do compared to pre-2.2, it's just 
documented in a more verbose (=RM-friendly) way.
I didn't mean to imply anything untowards :) I'm grateful for the work 
you guys have put into making it all more friendly. I know I have seen 
many of Mike M's wiki updates on this page too, and I've always been 
sure its for the better.


Even still, when I look at the process, I remember why I clung to 
Windows for so long :) Now I'm happily on Ubuntu and can still usually 
avoid such "fun" work :)


I'll happily soldier on though. I just wish it was all in Java :)

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720854#action_12720854
 ] 

Uwe Schindler commented on LUCENE-1693:
---

bq. Should I wait to put in the Highlighter update till you guys are done here?
You can start with highlighter, if this patch goes through, we can remove the 
next() methods from all tokenizers.
For consumers like the highlighter, there will be no need anymore to switch 
between old/new api. Just use the new API, it will also work with old 
tokenizers.



> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additiona

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720849#action_12720849
 ] 

Mark Miller commented on LUCENE-1693:
-

Should I wait to put in the Highlighter update till you guys are done here?

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720846#action_12720846
 ] 

Uwe Schindler commented on LUCENE-1693:
---

I have a solution to build in some shortcuts:
in initialize I use reflection (see the earlier patch) to find out, which of 
the three methods is implemented (check if 
this.getClass().getMethod(name,params).getDeclaringClass() == 
TokenStream.class, when this is true, the method was *not* overridden).
in incrementToken() the method checks if either next(Token) or next() is 
implemented and calls direct. The same in the other classes. next() should be 
ideally never called then.
I will post a patch later.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue onl

[jira] Updated: (LUCENE-1625) openReaderPassed not populated in CheckIndex.Status.SegmentInfoStatus

2009-06-17 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-1625:
--

Attachment: CheckIndex.patch

Attached patch for exposing all collected stats
(created with svn diff > CheckIndex.patch (please correct me if this is not the 
right way (this is my first patch)))


This patch breaks out the testing of field norms, terms, stored fields, and 
term vectors into their own methods
it also creates a status object for each one of these tests to make the results 
transparent

this status object exposes:
* stats previously only available from infoStream
* exception thrown if test fails (null if test was successful)

each SegmentInfoStatus will have these status objects attached

NOTE:
This patch allows that if one of the above tests fails, it will attempt to keep 
testing (to find all failures)
any failure will still result in the overall segment being rejected

> openReaderPassed not populated in CheckIndex.Status.SegmentInfoStatus
> -
>
> Key: LUCENE-1625
> URL: https://issues.apache.org/jira/browse/LUCENE-1625
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Tim Smith
> Attachments: CheckIndex.patch
>
>
> When using CheckIndex programatically, the openReaderPassed flag on the 
> SegmentInfoStatus is never populated (so it always comes back false)
> looking at the code, its clear that openReaderPassed is defined, but never 
> used
> furthermore, it appears that not all information that is propagated to the 
> "InfoStream" is available via SegmentIinfoStatus
> All of the following information should be able to be gather from public 
> properties on the SegmentInfoStatus:
> test: open reader.OK
> test: fields, norms...OK [2 fields]
> test: terms, freq, prox...OK [101 terms; 133 terms/docs pairs; 133 tokens]
> test: stored fields...OK [100 total field count; avg 1 fields per doc]
> test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
> fields per doc]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720822#action_12720822
 ] 

Uwe Schindler commented on LUCENE-1693:
---

I could change the calling chain:
incrementToken() calls next() calls next(Token), would this be better. 
next(Token) would per default set the delegate to the reuseable token. hmhm - 
thinking about it. Where is then the degradion?

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720820#action_12720820
 ] 

Michael Busch commented on LUCENE-1693:
---

{quote}
Ah I understand the problem: As I told, if a consumer (like a filter() calls 
next(Token) on the underlying filter), which does not implement this or 
implements the new API,  he will get a performance decrease because of cloning. 
I think, we should simply test this with the benchmarker. Mixing old and new 
API is always a performance decrease.
{quote}

Yes that's what I mean. But I think this will almost be the most common use 
case: I would think most users have chains that mix core streams/filters with 
custom filters. Also I assume most users who need high performance switched 
from next() to next(Token) by now. These users will see a performance 
degradation, which I predict will be similar or worse as going back to using 
next(), unless they implement the new API in their filters right away.

So those users will see a performance hit if they just do a drop-in replacement 
of the lucene jar. 

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. S

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720811#action_12720811
 ] 

Uwe Schindler commented on LUCENE-1693:
---

Ah I understand the problem: As I told, if a consumer (like a filter() calls 
next(Token) on the underlying filter), which does not implement this or 
implements the new API,  he will get a performance decrease because of cloning. 
I think, we should simply test this with the benchmarker. Mixing old and new 
API is always a performance decrease.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional command

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720809#action_12720809
 ] 

Uwe Schindler commented on LUCENE-1693:
---

The code is almost identical to before, the old code also copied the token to 
make it a full private copy.
There are three modes of operation:
- if incrementToken is implemented, docinverter will use it (the code always 
calls incrementToken, so no indirection)
- if next(Token) is implemented, the docinverterwill call incrementToken which 
is forwarded to next(Token), which is cheap
- if only next() is implemented, the docinverter will call incrementTojen, 
which forwards to next(Token) and this forwards to next(). But this is 
identical to before, only one indirection more: the old code got 
useNewAPI(false) and called next(Token) which forwarded to next()

So for indexing using the normal indexing components (docinverter), the code is 
never cloning more that with your code.

There is one other case: if you have an old consumer calling nextToken(Token), 
the tokenizer only implemented incrementToken, then you will get a performance 
degradion. But this is not the indexing case, it is e.g. reusing the tokenizer 
in a very old e.g. QueryParser. I did not find a good way to pass directly for 
this special case to incrementToken(). The problem is also, that incrementToken 
uses the internal buffer and not the supplied buffer.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note a

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720794#action_12720794
 ] 

Michael Busch commented on LUCENE-1693:
---

I'm looking at TokenStream.next():

{code:java}
  public Token next(final Token reusableToken) throws IOException {
// We don't actually use reusableToken, but still add this assert
assert reusableToken != null;
checkTokenWrapper();
return next();
  }

  /** Returns the next token in the stream, or null at EOS.
   *  @deprecated The returned Token is a "full private copy" (not
   *  re-used across calls to next()) but will be slower
   *  than calling {...@link #next(Token)} instead.. */
  public Token next() throws IOException {
checkTokenWrapper();
if (incrementToken()) {
  final Token token = (Token) tokenWrapper.delegate.clone();
  Payload p = token.getPayload();
  if (p != null) {
token.setPayload((Payload) p.clone());
  }
  return token;
}
return null;
  }
{code}

This seems like a big performance hit for users of the old API, no? Now every 
single Token will be cloned, even if they implement next(Token), as soon as the 
users have one filter in the chain that doesn't implement the new API yet.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even si

Re: Lucene 2.9 Again

2009-06-17 Thread Michael Busch


+1

 Michael

On 6/17/09 10:32 AM, Mark Miller wrote:

Michael Busch wrote:
We should just not put more items in the 2.9 list anymore (except bug 
fixes of course) and then fix the 30 issues and don't rush them too 
much. If it takes until end of July I think that's acceptable. A good 
quality of the release should be highest priority in my opinion.


 Michael
I agree. Our approach so far has not been to rush the issues that are 
outstanding, but to pressure a move to 3.1 if you don't think you can 
finish it reasonably soon. I'd expect the committers to stick with 
their normal standards for committing code, and I plan too as well. On 
the other hand, its also probably not a great idea for a bunch of huge 
changes to hit trunk right before release with no time to go though 
dev use. So I still think that, unless its an important issue for 2.9 
speficially, if you can't finish it by fairly early julyish - you 
should push to 3.1.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


Michael Busch wrote:
We should just not put more items in the 2.9 list anymore (except bug 
fixes of course) and then fix the 30 issues and don't rush them too 
much. If it takes until end of July I think that's acceptable. A good 
quality of the release should be highest priority in my opinion.


 Michael
I agree. Our approach so far has not been to rush the issues that are 
outstanding, but to pressure a move to 3.1 if you don't think you can 
finish it reasonably soon. I'd expect the committers to stick with their 
normal standards for committing code, and I plan too as well. On the 
other hand, its also probably not a great idea for a bunch of huge 
changes to hit trunk right before release with no time to go though dev 
use. So I still think that, unless its an important issue for 2.9 
speficially, if you can't finish it by fairly early julyish - you should 
push to 3.1.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael Busch

That means the release frequency should not exceed the new-committer 
frequency. :)


On 6/17/09 10:09 AM, Mark Miller wrote:

Michael Busch wrote:


One?!? I did 2.2, 2.3, 2.3.1, 2.3.2!


What can you do ... there was no new guy to relieve you :)




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael Busch

We should just not put more items in the 2.9 list anymore (except bug 
fixes of course) and then fix the 30 issues and don't rush them too 
much. If it takes until end of July I think that's acceptable. A good 
quality of the release should be highest priority in my opinion.


 Michael

On 6/17/09 10:09 AM, Mark Miller wrote:

Michael Busch wrote:

wanted to get 2.9 out really really soon.

really, really is probably not totally accurate. I just know how 
things can get drawn out. Even still, we have 30 some issues to 
resolve. If we don't make a drive though, when will 2.9 come out? Next 
fall at the earliest? Later? So much goodness to give to the users out 
there already. And Java 1.5 waiting for us. And removing all of these 
deprecations. We don't have to release tomorrow, but lets get this out 
there!





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


Michael Busch wrote:

wanted to get 2.9 out really really soon.

really, really is probably not totally accurate. I just know how things 
can get drawn out. Even still, we have 30 some issues to resolve. If we 
don't make a drive though, when will 2.9 come out? Next fall at the 
earliest? Later? So much goodness to give to the users out there 
already. And Java 1.5 waiting for us. And removing all of these 
deprecations. We don't have to release tomorrow, but lets get this out 
there!


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


Michael Busch wrote:


One?!? I did 2.2, 2.3, 2.3.1, 2.3.2!


What can you do ... there was no new guy to relieve you :)

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael Busch


On 6/17/09 6:23 AM, Mark Miller wrote:

I have a special gift in not being clear.

I was just saying "be prepared, your turn is coming ;) "

But I havn't done a release myself - we don't release that often 
despite discussion that we should release more often every year or so.


I did notice though, that Mike did the release right after joining, 
and Michael did a release right after joining, and so ... looks like I 
am next in line followed by you.




One?!? I did 2.2, 2.3, 2.3.1, 2.3.2!

Everyone who is unhappy with the release TODO's, go back in your mail 
archive to the 2.2 release and check how many tedious little changes we 
made to improve the release quality. And besides the maven stuff, there 
is not really more to do compared to pre-2.2, it's just documented in a 
more verbose (=RM-friendly) way.


The maven stuff is also pretty simple... just for signing the artifacts 
I hacked a tool, because that gets tedious otherwise. When we're at that 
point I can try to dig it up... I think Mike has such a tool too.


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael Busch


I'm happy to hear that :)
I suggested 2-3 weeks to prevent the schedule from being even tighter, 
as it sounded like you guys wanted to get 2.9 out really really soon.


I'm really busy the rest of June and will have much more time for Lucene 
in July. So if we could wait until end of July before we do the code 
freeze, and get 2.9 out early August, that'd mean much less sleep 
deprivation for me! And the likelihood that I'll get all my stuff in 
would be much higher...


 Michael

On 6/17/09 5:43 AM, Michael McCandless wrote:

On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch  wrote:

   

How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days
testing, so 2.9 should be out mid July? Sounds reasonable?
 


This schedule might be tight for me... I'm "on vacation" for the week
starting Jun 29.  Hopefully I can most of my issues done before then,
but that's a week and a half left at this point :)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-17 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720696#action_12720696
 ] 

Robert Muir commented on LUCENE-1692:
-

michael, ok. I know additional tests here (against the old api) might be more 
code to convert, but I think it will actually make the process easier, whenever 
that is or whatever is involved.

i have some time this evening to try to improve the coverage here (against the 
old api).


> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1692.txt, LUCENE-1692.txt
>
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720692#action_12720692
 ] 

Shai Erera commented on LUCENE-1693:


You can run tokenize.alg which invokes the ReadTokenTask, which iterates on a 
TokenStream. You'll probably need to modify the .alg file to create a different 
analyzer/token stream each time, and I think this can be done by the "rounds" 
syntax in benchmark.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Grant Ingersoll

Let's not forget Nutch...  Also, for that matter, Mahout uses Lucene's  
Analysis and Core (in fact, I just committed MAHOUT-126 which allows  
one to create Vectors from a Lucene index!), although those are just  
as consumers, I doubt there is a need for Mahout committers to change  
Lucene.



On Jun 17, 2009, at 10:04 AM, Michael McCandless wrote:


I agree.

I'm picturing some hopefully-not-that-distant future when we have a
queries "module" and analysis "module" that live quite separately from
Lucene & Solr's "core", and committers from both Solr and Lucene would
work on it.

Mike

On Wed, Jun 17, 2009 at 9:01 AM, Grant  
Ingersoll wrote:


On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote:


I would love to see function queries consolidated between Solr and
Lucene!  I think it's a prime example of duplicated and then  
diverged

sources between Lucene and Solr...


The primary reason it's diverged is it gets a lot of attention on  
Solr and
near zero in Lucene.  You rarely see someone on java-user ask about  
function
queries.  In Solr, it's a regular solution to many problems.  So,  
just like
the analysis problem, it strikes me as one of those areas that if  
it is
going to be done, and maintained, then Solr committers need write  
access.


-Grant


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Grant Ingersoll




On Jun 17, 2009, at 10:11 AM, Yonik Seeley wrote:

On Wed, Jun 17, 2009 at 8:57 AM, Grant  
Ingersoll wrote:

On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote:
There are parts that aren't strictly part of the release process  
IMO -

things like maven seem optional.


-1.  Maven support is not optional.


I can't always follow Lucene closely, but i'm pretty sure it never
became mandatory in Solr, and it's never been a part of any kind of
ASF release requirements.

It's nice if the release manager feels like doing it... but it also
seems like it can be done after the fact (for maven or other release
mechanisms) by those who care more about those.



It's pretty much the only way I consume Lucene and Solr anymore...   
So, yeah, I'll make sure it happens.  In Solr and Lucene, generating  
the artifacts is automatic anyway.  The only manual part is copying  
them up to the server.  I think people can handle doing an scp.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1673) Move TrieRange to core

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1673:
--

Attachment: LUCENE-1673.patch

Here some intermediate update...

> Move TrieRange to core
> --
>
> Key: LUCENE-1673
> URL: https://issues.apache.org/jira/browse/LUCENE-1673
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, 
> LUCENE-1673.patch
>
>
> TrieRange was iterated many times and seems stable now (LUCENE-1470, 
> LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
> its default FieldTypes (SOLR-940) and if possible I want to move it to core 
> before release of 2.9.
> Before this can be done, there are some things to think about:
> # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
> should they be called in core? I would suggest to leave it as it is. On the 
> other hand, if this keeps our only numeric query implementation, we could 
> call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
> are problems). Same for the TokenStreams and Filters.
> # Maybe the pairs of classes for indexing and searching should be moved into 
> one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
> problem here: ctors must be able to pass int, long, double, float as range 
> parameters. For the end user, mixing these 4 types in one class is hard to 
> handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
> int version of range query, hitting no results and so on. Same with other 
> types. Maybe accept java.lang.Number as parameter (because nullable for 
> half-open bounds) and one enum for the type.
> # TrieUtils move into o.a.l.util? or document or?
> # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
> o.a.l.analysis.tokenattributes? Somewhere else?
> # If we rename the classes, should Solr stay with Trie (because there are 
> different impls)?
> # Maybe add a subclass of AbstractField, that automatically creates these 
> TokenStreams and omits norms/tf per default for easier addition to Document 
> instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Yonik Seeley

On Wed, Jun 17, 2009 at 8:57 AM, Grant Ingersoll wrote:
> On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote:
>> There are parts that aren't strictly part of the release process IMO -
>> things like maven seem optional.
>
> -1.  Maven support is not optional.

I can't always follow Lucene closely, but i'm pretty sure it never
became mandatory in Solr, and it's never been a part of any kind of
ASF release requirements.

It's nice if the release manager feels like doing it... but it also
seems like it can be done after the fact (for maven or other release
mechanisms) by those who care more about those.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael McCandless

I agree.

I'm picturing some hopefully-not-that-distant future when we have a
queries "module" and analysis "module" that live quite separately from
Lucene & Solr's "core", and committers from both Solr and Lucene would
work on it.

Mike

On Wed, Jun 17, 2009 at 9:01 AM, Grant Ingersoll wrote:
>
> On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote:
>
>> I would love to see function queries consolidated between Solr and
>> Lucene!  I think it's a prime example of duplicated and then diverged
>> sources between Lucene and Solr...
>
> The primary reason it's diverged is it gets a lot of attention on Solr and
> near zero in Lucene.  You rarely see someone on java-user ask about function
> queries.  In Solr, it's a regular solution to many problems.  So, just like
> the analysis problem, it strikes me as one of those areas that if it is
> going to be done, and maintained, then Solr committers need write access.
>
> -Grant
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720676#action_12720676
 ] 

Uwe Schindler commented on LUCENE-1693:
---

Hi Michael,
I did not do any performance tests until now, I think you have the better 
knowledge about measuring tokenization performance. Important would be to 
compare perf of:
- Old API with useNewAPI=true
- Old API with useNewAPI=false
- My impl with defaults (onlyUseNewAPI=false)
- My impl with onlyUseNewAPI=true
For all tests, you should only use conformant streams (e.g. from core).
An good additional test would be to create a chain that has completely 
implemented incrementToken() and one only suplying next() for some chain 
entries.
Is this hard to do?

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You

[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1630:
---

Attachment: LUCENE-1630.patch

* Collector's acceptDocsOutOfOrder is abstract - this was a really good change 
since I completely forgot to override it in all home brewed Collectors to 
return true where applicable. I also surprised to see that <5 collectors 
actually should return false (most of them in tests).
* I added QueryWeight variants to Searchable and implemented in 
RemoteSearchable.
* Mike - I'm afraid I did some more code cleanup (not much though) - that was 
before I saw your last comment. sorry
* Handled the rest of the latest comments.

All tests pass

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
> LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
> regarding overriding these new methods.
> One other optimization that was discussed in LUCENE-1593 is to expose a 
> topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
> will be called, and additionally add a start() method to DISI. That will 
> allow Scorers to initialize either on start() or score(Collector). This was 
> proposed mainly because of BS and BS2 which check if they are initialize

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


I have a special gift in not being clear.

I was just saying "be prepared, your turn is coming ;) "

But I havn't done a release myself - we don't release that often despite 
discussion that we should release more often every year or so.


I did notice though, that Mike did the release right after joining, and 
Michael did a release right after joining, and so ... looks like I am 
next in line followed by you.


I'd be happy to split some of the work if its possible though - then 
perhaps we can both get our feet wet without having the full load of 
that wiki. I'm up for either way. Looks like we have some time to work 
it out.


- Mark

Uwe Schindler wrote:

Uwe Schindler wrote:


 Maybe Mark helps me and I can do
it alone the next time, if I have to? :-)

  

Tag team effort ? It will be my first release to, so that would be great !



Ah ok, I interpreted your mail different yesterday (but it was 1 or 2 am in
Germany...).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-17 Thread Grant Ingersoll



On Jun 15, 2009, at 2:11 PM, Grant Ingersoll wrote:




More questions:

1. What about Highlighter and MoreLikeThis?  They have not been  
converted.  Also, what are they going to do if the attributes they  
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions  
and offsets, but the analyzer doesn't produce them?  Caveat emptor?  
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that  
are meant for downstream TokenFilters, but not necessarily for  
indexing?  Offsets and type come to mind.  Is it the case now that  
those attributes are not automatically added to the index?   If they  
are ignored now, what if I want to add them?  I admit, I'm having a  
hard time finding the code that specifically loops over the  
Attributes.  I recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like  
it could work similar to the BoostingTermQuery.


So, I think I see #1 covered, how about #2, #3 and the notion of an  
AttributeTermQuery?  Anyone have thoughts on those?  I might have some  
time next week to work up a Query, as it sounds like fun, but don't  
hold it to me just yet.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 Again

2009-06-17 Thread Uwe Schindler

> Uwe Schindler wrote:
> >  Maybe Mark helps me and I can do
> > it alone the next time, if I have to? :-)
> >
> Tag team effort ? It will be my first release to, so that would be great !

Ah ok, I interpreted your mail different yesterday (but it was 1 or 2 am in
Germany...).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Grant Ingersoll



On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote:


I would love to see function queries consolidated between Solr and
Lucene!  I think it's a prime example of duplicated and then diverged
sources between Lucene and Solr...


The primary reason it's diverged is it gets a lot of attention on Solr  
and near zero in Lucene.  You rarely see someone on java-user ask  
about function queries.  In Solr, it's a regular solution to many  
problems.  So, just like the analysis problem, it strikes me as one of  
those areas that if it is going to be done, and maintained, then Solr  
committers need write access.


-Grant


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 Again

2009-06-17 Thread Uwe Schindler

> On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote:
> 
> > On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller
> > wrote:
> > There are parts that aren't strictly part of the release process IMO -
> > things like maven seem optional.
> 
> -1.  Maven support is not optional.
> 
> +1 for more automation.  For the record, once setup, Maven (as opposed
> to Ant) release (i.e. on Mahout http://cwiki.apache.org/MAHOUT/how-to-
> release.html)
>   consists of far fewer steps.  The only manual ones after one-time
> setup are the announcements and the copy from staging to release (and
> even that, I think, can be done better using Nexus).  Note, I'm not
> voting to change to Maven, just saying there is room for automation.

Please no maven! :(

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Mark Miller


Uwe Schindler wrote:

 Maybe Mark helps me and I can do
it alone the next time, if I have to? :-)
  

Tag team effort ? It will be my first release to, so that would be great !

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Grant Ingersoll



On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote:

On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller  
wrote:

There are parts that aren't strictly part of the release process IMO -
things like maven seem optional.


-1.  Maven support is not optional.

+1 for more automation.  For the record, once setup, Maven (as opposed  
to Ant) release (i.e. on Mahout http://cwiki.apache.org/MAHOUT/how-to-release.html) 
 consists of far fewer steps.  The only manual ones after one-time  
setup are the announcements and the copy from staging to release (and  
even that, I think, can be done better using Nexus).  Note, I'm not  
voting to change to Maven, just saying there is room for automation.


-Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Lucene 2.9 Again

2009-06-17 Thread Uwe Schindler

I tend also to a little bit later; maybe we need more discussions about
NumericField and NumericSortField, especially between the two fractions Mike
vs. Yonik :-)

After finishing the TokenStream simplification and optimization, I will now
again start rewriting of javadocs for trie and hopefully I can commit in a
day-or-two(TM).

Maybe start RCs in second quarter of July? Maybe Mark helps me and I can do
it alone the next time, if I have to? :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Wednesday, June 17, 2009 2:43 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Lucene 2.9 Again
> 
> On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote:
> 
> > How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days
> > testing, so 2.9 should be out mid July? Sounds reasonable?
> 
> This schedule might be tight for me... I'm "on vacation" for the week
> starting Jun 29.  Hopefully I can most of my issues done before then,
> but that's a week and a half left at this point :)
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael McCandless

On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote:

> How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days
> testing, so 2.9 should be out mid July? Sounds reasonable?

This schedule might be tight for me... I'm "on vacation" for the week
starting Jun 29.  Hopefully I can most of my issues done before then,
but that's a week and a half left at this point :)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: (was: LUCENE-1693.patch)

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Sorry, small bug in cloning inside next(): the POSToken-test was failing again. 
But now it works also correct.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Attached is a new patch, that implements the last idea:
- There is no more copying of Tokens, so the API should have the same speed 
(almost) as before.
- Per default, the chain of TokenStreams/TokenFilters can be mixed completely 
(test that explicitely tests this is still missing), the drawback is, that 
there is only *one* attribute instance called TokenWrapper (package private) 
that manages the exchange of the Token instance behind.
- If the user knows, that all tokenizers in his JVM implement incrementToken 
and do not fallback to next(), he can increase speed by using the static setter 
setOnlyUseNewAPI(true). In this case, no single TokenWrapper is initialized and 
code will use the normal Attribute factory to generate the Attributes. If some 
old code is still available or your consumer calls next(), you will get an UOE 
during tokenization. The same happens, if you override initialize() and 
instantiate your attributes manually without super.initialize().
- When the old API is removed, TokenWrapper and large parts inside TokenStream 
can be removed and incrementToken() made abstract. This is identical to setting 
onlyUseNewAPI to true.
- the api setting can only be static, because the attribute instances are 
generated during construction of the streams and so a later downgrade to 
TokenWrapper is not possible.

Documentation inside this patch enforce, that at least all core tokenizers and 
consumers are conformant, so one must be able to set 
TokenStream.setOnlyUseNewAPI to true and then use StandardAnalyzer without any 
problem. When contrib is transformed, we can extend this to contrib.

Because the code wraps the old API completely, all converted streams can be 
changed to only implement only incrementToken() using attributes. Super's 
TokenStream.next() and next(Token) manage the rest. There is no speed degradion 
by this, it is safe to remove (and all will be happy)!

Uwe

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved i

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720632#action_12720632
 ] 

Shai Erera commented on LUCENE-1630:


{quote}
You forgot to fill in the "?" in CHANGES

I guess you're looking at the previous patch. It already has your name in the 
latest 
{quote}

Sorry, you're right - there are two sections in CHANGES which I've added text 
to, and I put your name in the second one only.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
> regarding overriding these new methods.
> One other optimization that was discussed in LUCENE-1593 is to expose a 
> topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
> will be called, and additionally add a start() method to DISI. That will 
> allow Scorers to initialize either on start() or score(Collector). This was 
> proposed mainly because of BS and BS2 which check if they are initialized in 
> every call to next(), skipTo() and score(). Personally I prefer to see that 
> in a separate issue, following that one (as it might add methods to 
> QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a c

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720629#action_12720629
 ] 

Shai Erera commented on LUCENE-1630:


bq. You forgot to fill in the "?" in CHANGES

I guess you're looking at the previous patch. It already has your name in the 
latest :)

bq. How come {{Document doc(int n, FieldSelector fieldSelector) throws 
CorruptIndexException, IOException}} is added to Searcher.java in your patch?

It's leftover from when I first deprecated Searchable - I wanted to move all 
the methods from Searchable to Searcher so that we don't forget that later. 
Will remove it.

bq. Rethinking fixing Searchable now vs later

Ok I will do that. Deprecate the current ones and add new ones. We need to keep 
the Weight-variant methods in, since someone might call it. If he doesn't 
extend Searcher or implement Searchable, there's no real break in back-compat 
for him.

bq. As much as I love all the little code cleanups

Apologies ... I'll try to restrain myself. That's why I didn't want to make 
Collector.accepts..() abstract - it would force me to touch more files, which 
means more code cleanups ;). I'll do my best to stop.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary ch

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720622#action_12720622
 ] 

Shai Erera commented on LUCENE-1630:


It isn't and that's what I expressed in the javadocs. If you plan to iterate on 
a Scorer, you should always ask for in-order one, and that's what IndexSearcher 
does. Mike suggested above to refine that documentation to say that if you plan 
to call nextDoc() only, you can still ask for an out-of-order scorer.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
> regarding overriding these new methods.
> One other optimization that was discussed in LUCENE-1593 is to expose a 
> topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
> will be called, and additionally add a start() method to DISI. That will 
> allow Scorers to initialize either on start() or score(Collector). This was 
> proposed mainly because of BS and BS2 which check if they are initialized in 
> every call to next(), skipTo() and score(). Personally I prefer to see that 
> in a separate issue, following that one (as it might add methods to 
> QueryWeight).

-- 
This message is automatically generated by JIRA.
-
Yo

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720623#action_12720623
 ] 

Michael McCandless commented on LUCENE-1630:


Still working through the patch... here's what I found so far:


  * You forgot to fill in the "?" in CHANGES :)

  * Can you change the default for BooleanQuery.allowDocsOutOfOrder to
true?

  * How come {{Document doc(int n, FieldSelector fieldSelector) throws
CorruptIndexException, IOException}} is added to Searcher.java in
your patch?

  * Rethinking fixing Searchable now vs later: first off, we've
already changed the interface in 2.9 (added Collector); second
off, in our changes with Fieldable we both changed our policy and
the interface, in one release.  Maybe we should in fact switch to
QueryWeight?  (I'm not sure).  This patch already breaks back
compat of Searcher (there are new abstract methods), anyway.

  * Instead of saying "there is a chance" in the javadoc in BQ, can
you change it to say "BQ will return an out-of-order scorer if
requested"?  (There's no chance in the matter...).

  * In fact, DocumentsWriter very much needs for the docs to be scored
in order (it breaks out of the loop on the first out-of-bounds
doc).  Can you put that back?

  * As much as I love all the little code cleanups, can you resist the
temptation, especially in such large patches as this?  I think a
separate issue that does pure code cleanups would be a great way
to fix these, going forward...

  * "not need anymore" --> "not needed anymore"

  * We can now make things final in BS2, like countingSumScorer,
*Scorers, etc?


> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
>

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720619#action_12720619
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I wasn't following the issue closely, so this question might by silly - how 
does out-of-order scoring/collection marry with filters?
If I remember right, filter/scorer intersection relies on proper orderness.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
> regarding overriding these new methods.
> One other optimization that was discussed in LUCENE-1593 is to expose a 
> topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
> will be called, and additionally add a start() method to DISI. That will 
> allow Scorers to initialize either on start() or score(Collector). This was 
> proposed mainly because of BS and BS2 which check if they are initialized in 
> every call to next(), skipTo() and score(). Personally I prefer to see that 
> in a separate issue, following that one (as it might add methods to 
> QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720612#action_12720612
 ] 

Shai Erera commented on LUCENE-1630:


Ok I will change acceptsDocsOutOfOrder on Collector to abstract, and implement 
it in all core collectors.

I've already changed BooleanWeight's impl, as I wrote above "I fixed 
BooleanWeight to return true if there is a chance it will return BS (i.e. there 
are no required clauses and <32 prohibited clauses)".

I still don't think scoresOutOfOrder can live on Scorer. IndexSearcher's search 
methods all call eventually to search(QueryWeight, Filter, Collector), which 
means that by that time you should already have a Collector ready (note that 
the user may pass its own Collector). Therefore such a utility will not work 
for user provided collectors, and specifically this method creates a Scorer for 
a given reader, but never a Collector (and a Collector is created just once).

So if we were to take your approach, it'd deviate the "fast search methods" 
from the other search methods. The others would call search(Weight, Filter, 
Collector) and the "fast ones" would not (since they don't have a Collector 
yet). This will complicate IndexSearcher's code, IMO unnecessarily. If we want 
to differentiate the two, I can do that w/o a helper class.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720607#action_12720607
 ] 

Michael McCandless commented on LUCENE-1630:


{quote}
bq. Can we make Collector.supportsDocsOutOfOrder abstract? Defaulting to false 
isn't great (I'd rather subclass think about the question).

In general, I tried to avoid it since that would require changing all core 
Collectors. There aren't many, but still ...

This goes for QueryWeight.scoresOutOfOrder - wanted to avoid changing all core 
Weights to impl the method w/ "return false". I actually think that many 
Weights/Scorers do score documents in-order, hence the default impl.
{quote}

OK... thinking more about it, I think having
QueryWeight.scoresDocsOutOfOrder default to "false" is reasonable (I
think most do in-order scoring).  Also, I think the perf gains are
relatively small if a QueryWeight returns "true", so, by defaulting to
false we're not leaving much performance on the table.

But for Collector it's a different story: the gains by allowing
BooleanQuery to use its out-of-order scorer are sizable.  And, I'd
expect many custom Collectors would be fine with out-of-order
collection.

Since these are brand new classes, we have the chance to do it well.
It's very much an expert thing already to make your own Collector...

{quote}
bq. If a given Scorer.scoresOutOfOrder returns true, does that mean nextDoc is 
allowed to return docs out of order?

When you deal with a Scorer which returns out-of-order, you can only call 
scorer.score(Collector). If you're going to iterate, you're going to have to 
create a Scorer in-order, and that's what IndexSearcher does. I'll spell it out 
clearly in the javadocs.
{quote}

That may be a bit too strong -- eg BooleanScorer lets you nextDoc()
your way through its out-of-order docs (just not advance()).  Maybe
state just that you can't use advance in the javadocs?

{quote}
bq. Should scoresOutOfOrder() move from QueryWeight --> Scorer?

We've discussed it few posts up. When this information in in Scorer, I should 
first ask for a Scorer, and only then I can create a Collector. If I'll use the 
Scorer immediately, then that'll be ok. However, that's not the case in 
IndexSearcher, and results in a bug in Spatial, and unless we want to uglify 
IndexSearcher code, it seemed that this can sit in QueryWeight.

But I do think it's a problematic method in QW too, since if it returns false 
by default, I'll create a Collector which expects docs in-order, but then I'd 
lose the optimization in BooleanWeight which may return an out-of-order 
superior Scorer. If I return true, I'll create a Collector which expects 
out-of-order, and the Scorer (again, an example from BW) may be actually 
in-order, and I've wasted unnecessary 'if doc > topDoc' cycles.

So I don't know what's better: make IndexSearcher code more complicated or 
sacrifice a potential loss of this optimization?
{quote}

Could we "invert" the logic in IndexSearcher that makes a collector,
eg by creating a utility class that will on-demand provide a collector
once told whether the docs will be in order?  Basically, "curry" all
the other details about the collector (sorting by score vs field, if
by field whether to track scores & max score).  Then inside doSearch
when we finally know if the Scorer will be in-order, we ask that
helper class for the collector?  The first time the helper class is
called, it makes the collector; subsequent times it returns the same
one.

There is a risk, though, if the Scorer returned for a given segment
"changes its mind"... eg the first segment's scorer says the docs will
be in order, and then some later segment's scorer says they will not
be in order.  So... that's risky.

Maybe we leave it on QueryWeight, but fix BooleanWeight to return
exactly the right thing?  (It can be exact, right?  Because we know
the conditions under which BooleanWeight, if allowed to do so, would
choose to return an out-of-order scorer).

{quote}
bq. Shouldn't Searchable cutover to QueryWeight too? (We are keeping 
Searchable, but allowing changes to it)

I wrote that above too - I don't think we can declare and execute right in 2.9 
that Searchable can be changed unexpectedly. So I added a NOTE to its javadocs 
and thought to do the change post 2.9, when we remove Weight. We'd be forced to 
change these methods to QueryWeight, and fix RemoteSearchable too. And it will 
be consistent w/ our back-compat policy (at least the part where we declare on 
an upcoming change before it happens).

But if you think otherwise, I don't mind deprecating and adding new methods 
(I've got used to it already, I almost do it blindly  ).
{quote}

[Sorry, I'm losing track of all the comments]

OK let's defer the changes to Searchable until 3.1.  Make sure you
open a follow-on issue so we remember ;)


> Mating Collector and Scorer on doc Id orderness
>

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720593#action_12720593
 ] 

Michael McCandless commented on LUCENE-1673:


bq. Want a convenience method for the user? TrieUtils.createDocumentField(...) 
, same as the sortField currently works.

I don't think this is "convenient" enough.

bq.  If you'd like to have end-to-end experience for numeric fields, build 
something schema-like and put it in contribs

+1

Long (medium?) term I'd love to get to this point; I think it'd make
Lucene quite a bit more consumable.  But we shouldn't sacrifice
consumability today on the hope for that future nirvana.

You already have a nice starting point here... is that something you
could donate?

{quote}
bq. I do agree that retrieving a doc is already "buggy", in that various things 
are lost from your index time doc (a well known issue at this point!)

How on earth is it buggy?  You're working with an inverted index, you aren't 
supposed to get original document from it in the first place. It's like saying 
a hash function is buggy because it is not reversible.
{quote}

I completely agree: you're not supposed to get the original doc back.
And the fact that Lucene's API now "pretends" you do, is wrong.  We all
agree to that, and that we need to fix Lucene.

But, as things now stand, it's not yet fixed, so until it's fixed, I
don't like intentionally making it worse.

It'd be great to simply stop returning Document from IndexReader.
Wanna make a patch?  I don't think the new sheriff'd hold 2.9 for this
though ;)

{quote}
bq. "hey how come I didn't get a NumericField back on my doc?

Perhaps a good reason to not add a NumericField.
{quote}

I think NumericField (when building your doc) is still valuable, even
if we can't return NumericField when you retrieve the doc.

OK... since adding the bit to the stored fields is controversial, I
think for 2.9, we should only add NumericField at indexing (document
creation) time.  So, we don't store a new bit in stored fields file
and the index format is unchanged.


> Move TrieRange to core
> --
>
> Key: LUCENE-1673
> URL: https://issues.apache.org/jira/browse/LUCENE-1673
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch
>
>
> TrieRange was iterated many times and seems stable now (LUCENE-1470, 
> LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
> its default FieldTypes (SOLR-940) and if possible I want to move it to core 
> before release of 2.9.
> Before this can be done, there are some things to think about:
> # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
> should they be called in core? I would suggest to leave it as it is. On the 
> other hand, if this keeps our only numeric query implementation, we could 
> call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
> are problems). Same for the TokenStreams and Filters.
> # Maybe the pairs of classes for indexing and searching should be moved into 
> one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
> problem here: ctors must be able to pass int, long, double, float as range 
> parameters. For the end user, mixing these 4 types in one class is hard to 
> handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
> int version of range query, hitting no results and so on. Same with other 
> types. Maybe accept java.lang.Number as parameter (because nullable for 
> half-open bounds) and one enum for the type.
> # TrieUtils move into o.a.l.util? or document or?
> # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
> o.a.l.analysis.tokenattributes? Somewhere else?
> # If we rename the classes, should Solr stay with Trie (because there are 
> different impls)?
> # Maybe add a subclass of AbstractField, that automatically creates these 
> TokenStreams and omits norms/tf per default for easier addition to Document 
> instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1699) Field tokenStream should be usable with stored fields.

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720578#action_12720578
 ] 

Michael McCandless commented on LUCENE-1699:



Patch looks good:

  * Can you make sure CHANGES describes this new behavior (Field is
allowed to have both a tokenStream and a String/Reader/binary
value)?

  * The javadoc for readerValue is wrong (copy/paste from stringValue)

  * Can you spell out more clearly in the javadocs that even when a
tokenStream value is set, one of String/Reader/binary may also be
set, or, not, and if so, that "other" value is only used for
stored fields.  Eg, explain why one would use setTokenStream
instead of setValue(TokenStream value).


> Field tokenStream should be usable with stored fields.
> --
>
> Key: LUCENE-1699
> URL: https://issues.apache.org/jira/browse/LUCENE-1699
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1699.patch
>
>
> Field.tokenStream should be usable for indexing even for stored values.  
> Useful for many types of pre-analyzed values (text/numbers, etc)
> http://search.lucidimagination.com/search/document/902bad4eae20bdb8/field_tokenstreamvalue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720574#action_12720574
 ] 

Michael McCandless commented on LUCENE-1673:


Note that LUCENE-1505 is open for cutting over contrib/spacial to 
NumericUtils

> Move TrieRange to core
> --
>
> Key: LUCENE-1673
> URL: https://issues.apache.org/jira/browse/LUCENE-1673
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch
>
>
> TrieRange was iterated many times and seems stable now (LUCENE-1470, 
> LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
> its default FieldTypes (SOLR-940) and if possible I want to move it to core 
> before release of 2.9.
> Before this can be done, there are some things to think about:
> # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
> should they be called in core? I would suggest to leave it as it is. On the 
> other hand, if this keeps our only numeric query implementation, we could 
> call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
> are problems). Same for the TokenStreams and Filters.
> # Maybe the pairs of classes for indexing and searching should be moved into 
> one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
> problem here: ctors must be able to pass int, long, double, float as range 
> parameters. For the end user, mixing these 4 types in one class is hard to 
> handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
> int version of range query, hitting no results and so on. Same with other 
> types. Maybe accept java.lang.Number as parameter (because nullable for 
> half-open bounds) and one enum for the type.
> # TrieUtils move into o.a.l.util? or document or?
> # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
> o.a.l.analysis.tokenattributes? Somewhere else?
> # If we rename the classes, should Solr stay with Trie (because there are 
> different impls)?
> # Maybe add a subclass of AbstractField, that automatically creates these 
> TokenStreams and omits norms/tf per default for easier addition to Document 
> instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1505) Remove NumberUtils from spatial contrib

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720572#action_12720572
 ] 

Michael McCandless commented on LUCENE-1505:


LUCENE-1496 is "won't fix" because trie's NumericUtils subsumes Solr's 
NumberUtils, ie, we now need to migrate local lucene to NumericUtils.  And we 
want to do this for 2.9, since local lucene is not yet released and we have the 
freedom to make such an otherwise drastic change to the index format.

I'll update this issue to reflect it's new goal.

> Remove NumberUtils from spatial contrib
> ---
>
> Key: LUCENE-1505
> URL: https://issues.apache.org/jira/browse/LUCENE-1505
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/spatial
>Reporter: Ryan McKinley
>Assignee: Simon Willnauer
> Fix For: 2.9
>
>
> Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
> it would depend on solr)
> Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils

2009-06-17 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1505:
---

Fix Version/s: 2.9
  Summary: Change contrib/spatial to use trie's NumericUtils, and 
remove NumberUtils  (was: Remove NumberUtils from spatial contrib)

> Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
> -
>
> Key: LUCENE-1505
> URL: https://issues.apache.org/jira/browse/LUCENE-1505
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/spatial
>Reporter: Ryan McKinley
>Assignee: Simon Willnauer
> Fix For: 2.9
>
>
> Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
> it would depend on solr)
> Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720570#action_12720570
 ] 

Michael McCandless commented on LUCENE-1692:


Robert, you should probably also hold up on API conversion, since the API 
itself is now changing (LUCENE-1693).

> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1692.txt, LUCENE-1692.txt
>
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720571#action_12720571
 ] 

Michael Busch commented on LUCENE-1693:
---

{quote}
I am working on that, I have a meeting now, after that. 
{quote}

Good luck. I'm off to bed...

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-06-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720569#action_12720569
 ] 

Michael McCandless commented on LUCENE-1516:


{quote}
Currently we check the info for deletes, however with this
patch, I think we need to check the segmentReader which could
have deletes that don't show up in the info.
{quote}

Good catch!  Can you open a new issue & attach patch?  Though: how would you do 
this?  Right now MergePolicy never receives a SegmentReader, and makes all its 
decisions based on the SegmentInfo.  Each SegmentReader tracks its own 
pendingDelCount... maybe we add a private pendingDelCount to SegmentInfo, and 
change SegmentReader to use that instead?  That'd be a single source, and then 
the merge policy could retrieve it...

> Integrate IndexReader with IndexWriter 
> ---
>
> Key: LUCENE-1516
> URL: https://issues.apache.org/jira/browse/LUCENE-1516
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
> LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The current problem is an IndexReader and IndexWriter cannot be open
> at the same time and perform updates as they both require a write
> lock to the index. While methods such as IW.deleteDocuments enables
> deleting from IW, methods such as IR.deleteDocument(int doc) and
> norms updating are not available from IW. This limits the
> capabilities of performing updates to the index dynamically or in
> realtime without closing the IW and opening an IR, deleting or
> updating norms, flushing, then opening the IW again, a process which
> can be detrimental to realtime updates. 
> This patch will expose an IndexWriter.getReader method that returns
> the currently flushed state of the index as a class that implements
> IndexReader. The new IR implementation will differ from existing IR
> implementations such as MultiSegmentReader in that flushing will
> synchronize updates with IW in part by sharing the write lock. All
> methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720568#action_12720568
 ] 

Uwe Schindler commented on LUCENE-1693:
---

bq. I think you should try it out and see if you run into problems. This should 
not be much code to write. 

I am working on that, I have a meeting now, after that. 

bq. You might have to do tricks with Tee/Sink, if the sink is wrapped by a 
filter with the new API, but the tee wraps a stream with the old API, or vice 
versa.

This is currently working without any problems, but I want to add a test-case, 
that explicitely chains some dummy-filters in deprecated and not-deprecated 
form and looks whats coming out. But it should work.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to t

Re: Lucene 2.9 Again

2009-06-17 Thread Simon Willnauer

On Wed, Jun 17, 2009 at 10:42 AM, Michael
McCandless wrote:
> I would love to see function queries consolidated between Solr and
> Lucene!  I think it's a prime example of duplicated and then diverged
> sources between Lucene and Solr...
>
> And it's fabulous that you are "volunteering", Simon ;)  We have
> precious few volunteers that stride both communities well enough, and
> have the itch, to do this.
>
> So I'd love to see progress made towards this but I also think
> it's a little too big to hold up 2.9 for.
Yeah I agree!
>
> The back compat requirement is certainly important, but I would assume
> workable, ie it should not hold up this consolidation...
I think this is a step by step task and it should be done with back
compat in mind. I think it is not crucial to have it in 2.9 as solr
might be keen to get 1.5 lucene releases integrated too. So its not a
big deal if it gets integrated with 3.* releases.
>
> Mike
>
> On Wed, Jun 17, 2009 at 4:27 AM, Simon
> Willnauer wrote:
>> On Tue, Jun 16, 2009 at 11:47 PM, Yonik
>> Seeley wrote:
>>> On Tue, Jun 16, 2009 at 5:38 PM, Simon
>>> Willnauer wrote:
 I was thinking of adding a patch for
 https://issues.apache.org/jira/browse/LUCENE-1085
>>>
>>> That's *way* too big of an issue and it breaks back compat in Solr (to
>>> change from Solr's to Lucene's version - I know many people who have
>>> implemented and plugged in their own functions.)
>> Do you have a pointer to back compat policy in solr or is it the same
>> as in Lucene?!
>>
>> simon
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720562#action_12720562
 ] 

Michael Busch commented on LUCENE-1693:
---

For caching:
I guess you would have to implement the wrapper's clone() method such that it 
returns what delegate.clone() returns. This would put a clone of the original 
Token (or subclass) into the cache, instead a clone of the wrapper, which is 
good. Then the second clone also clones the original Token again and put's it 
into a second wrapper that the CachingTokenStream owns. Hmm complicated, but 
should work. 

Need to think more about if all mixes of old and new TokenSteams would work... 
and if this approach affects performance in any way or changes runtime behavior 
of corner cases...

Gosh, this is like running a huge backwards-compatibility junit test suite in 
my head every time we consider a different approach. :)

I think you should try it out and see if you run into problems. This should not 
be much code to write. You might have to do tricks with Tee/Sink, if the sink 
is wrapped by a filter with the new API, but the tee wraps a stream with the 
old API, or vice versa.

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change,

Re: Lucene 2.9 Again

2009-06-17 Thread Michael McCandless

On Tue, Jun 16, 2009 at 7:16 PM, Yonik Seeley wrote:
> On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller wrote:
>> I've looked at the release todo wiki and I am still having nightmares.
>
> Indeed - it's gotten 5 times longer since the last time I did Lucene or Solr.
> There are parts that aren't strictly part of the release process IMO -
> things like maven seem optional.

For better or worse, it gets bigger whenever someone (recently, me!)
makes a silly mistake and then goes and updates the release todo ;)

I do think it could use some consolidating, though...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael McCandless

On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote:
> Cool, seems like Mark is volunteering to be the 2.9 release manager ;)

Yay!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Michael McCandless

I would love to see function queries consolidated between Solr and
Lucene!  I think it's a prime example of duplicated and then diverged
sources between Lucene and Solr...

And it's fabulous that you are "volunteering", Simon ;)  We have
precious few volunteers that stride both communities well enough, and
have the itch, to do this.

So I'd love to see progress made towards this but I also think
it's a little too big to hold up 2.9 for.

The back compat requirement is certainly important, but I would assume
workable, ie it should not hold up this consolidation...

Mike

On Wed, Jun 17, 2009 at 4:27 AM, Simon
Willnauer wrote:
> On Tue, Jun 16, 2009 at 11:47 PM, Yonik
> Seeley wrote:
>> On Tue, Jun 16, 2009 at 5:38 PM, Simon
>> Willnauer wrote:
>>> I was thinking of adding a patch for
>>> https://issues.apache.org/jira/browse/LUCENE-1085
>>
>> That's *way* too big of an issue and it breaks back compat in Solr (to
>> change from Solr's to Lucene's version - I know many people who have
>> implemented and plugged in their own functions.)
> Do you have a pointer to back compat policy in solr or is it the same
> as in Lucene?!
>
> simon
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: madvise(ptr, len, MADV_SEQUENTIAL)

2009-06-17 Thread Michael McCandless

I think readahead would be less interesting to Lucene; while we
definitely want a certain amount of readahead (to "amortize" the
seeking), too much readahead starts means evicting things from the IO
cache.  OSs already do a fair job (I think) of some amount of
readahead, though if we do gain posix_fadvise in Java and we use it to
advise to not IO cache those reads, I wonder how that impacts the OS's
readahead...

Some serious empirical testing is needed.  Let the machines tell us
how they work ;)

Mike

On Tue, Jun 16, 2009 at 11:20 PM, Jason
Rutherglen wrote:
> Sorry, not portable, but POSIX_FADV_WILLNEED is which can be used with
> posix_fadvise.
>
> On Tue, Jun 16, 2009 at 8:12 PM, Jason Rutherglen
>  wrote:
>>
>> Perhaps we'd also like to request readahead be included in JDK7?
>>
>> http://linux.die.net/man/2/readahead
>>
>> On Tue, Jun 16, 2009 at 9:03 AM, Michael McCandless
>>  wrote:
>>>
>>> Hmm... posix_fadvise lets you do this with a file descriptor; this
>>> would be better for Lucene (per descriptor not per mapped region of
>>> RAM) since we could "advise" independent of which FSDir impl is in
>>> use...
>>>
>>> Mike
>>>
>>> On Tue, Jun 16, 2009 at 10:32 AM, Uwe Schindler wrote:
>>> > But to use it, we should change MMapDirectory to also use the mapping
>>> > when
>>> > writing to files. I thought about it, it is very simple to implement
>>> > (just
>>> > copy the IndexInput and change all gets() to sets())
>>> >
>>> > -
>>> > Uwe Schindler
>>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>>> > http://www.thetaphi.de
>>> > eMail: u...@thetaphi.de
>>> >
>>> >> -Original Message-
>>> >> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>>> >> Sent: Tuesday, June 16, 2009 4:22 PM
>>> >> To: java-dev@lucene.apache.org
>>> >> Cc: Alan Bateman; nio-disc...@openjdk.java.net
>>> >> Subject: Re: madvise(ptr, len, MADV_SEQUENTIAL)
>>> >>
>>> >> Lucene could really make use of this method.  When a segment merge
>>> >> takes place, we can read & write many GB of data, which without
>>> >> madvise on many OSs would effectively flush the IO cache (thus hurting
>>> >> our search performance).
>>> >>
>>> >> Mike
>>> >>
>>> >> On Mon, Jun 15, 2009 at 6:01 PM, Jason
>>> >> Rutherglen wrote:
>>> >> > Thanks Alan.
>>> >> >
>>> >> > I cross posted this to the Lucene dev list where we are discussing
>>> >> > using
>>> >> > madvise for minimizing unnecessary IO cache usage when merging
>>> >> > segments
>>> >> > (where we really want the newly merged segments in the IO cache
>>> >> > rather
>>> >> than
>>> >> > the old segment files).
>>> >> >
>>> >> > How would the advise method work?  Would there need to be a hint in
>>> >> > the
>>> >> > FileChannel.map method?
>>> >> >
>>> >> > -J
>>> >> >
>>> >> > On Mon, Jun 15, 2009 at 12:36 AM, Alan Bateman
>>> >> > 
>>> >> wrote:
>>> >> >>
>>> >> >> Jason Rutherglen wrote:
>>> >> >>>
>>> >> >>> Is there going to be a way to do this in the new Java IO APIs?
>>> >> >>
>>> >> >> Good question, as it has come up a few times and is needed for some
>>> >> >> important use-cases. A while back I looked into adding a
>>> >> >> MappedByteBuffer#advise method to allow the application provide
>>> >> >> hints
>>> >> on the
>>> >> >> expected usage but didn't complete it. We should probably look at
>>> >> >> this
>>> >> again
>>> >> >> for jdk7.
>>> >> >>
>>> >> >> -Alan.
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>> >
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 Again

2009-06-17 Thread Simon Willnauer

On Tue, Jun 16, 2009 at 11:47 PM, Yonik
Seeley wrote:
> On Tue, Jun 16, 2009 at 5:38 PM, Simon
> Willnauer wrote:
>> I was thinking of adding a patch for
>> https://issues.apache.org/jira/browse/LUCENE-1085
>
> That's *way* too big of an issue and it breaks back compat in Solr (to
> change from Solr's to Lucene's version - I know many people who have
> implemented and plugged in their own functions.)
Do you have a pointer to back compat policy in solr or is it the same
as in Lucene?!

simon
>
> -Yonik
> http://www.lucidimagination.com
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720550#action_12720550
 ] 

Uwe Schindler commented on LUCENE-1693:
---

OK, I have a solution:
I write a wrapper class (a reference) that implement all token attribute 
interfaces but pass this downto the wrapped Token/Subclass-of-Token. Instead of 
cloning the token when wrapping the return value of next(), I could simply put 
it into the wrapper. The instance keeps the same, only the delegate is 
different. Outside users or TokenStreams using the new API, will only see one 
instance that implements all interfaces.

(in principle the same like your backwards-compatibility thing in the 
docinverter)

Would this be an idea?

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to

[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1630:
---

Attachment: LUCENE-1630.patch

Fixed most of your comments Mike. I also noticed I did not document 
Collector.acceptsDocsOutOfOrder, so fixed that too.

The remaining things we should agree on are:
* deprecated Weight and add QueryWeight variants to Searchable. I prefer to do 
it post 2.9.
* move scoresDocsOutOfOrder to Scorer instead of Weight. I fixed BooleanWeight 
to return true if there is a chance it will return BS (i.e. there are no 
required clauses and <32 prohibited clauses). I guess we'll need to discuss 
that one more.
* Make Collector.acceptsDocsOutOfOrder and QueryWeight.scoresDocsOutOfOrder 
abstract - I think the default impl makes sense for most of the imps out there 
and the ones in core, but I don't have a strong feeling against making it 
abstract.

All tests pass, and javadocs are good as well.

> Mating Collector and Scorer on doc Id orderness
> ---
>
> Key: LUCENE-1630
> URL: https://issues.apache.org/jira/browse/LUCENE-1630
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch
>
>
> This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
> API on Scorer and Collector such that one can create an optimized Collector 
> based on a given Scorer's doc-id orderness and vice versa. Copied from 
> LUCENE-1593, here is the list of changes:
> # Deprecate Weight and create QueryWeight (abstract class) with a new 
> scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
> method. QueryWeight implements Weight, while score(reader) calls 
> score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
> is defined abstract.
> #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
> one will also be deprecated, as well as package-private.
> #* Add to Query variants of createWeight and weight which return QueryWeight. 
> For now, I prefer to add a default impl which wraps the Weight variant 
> instead of overriding in all Query extensions, and in 3.0 when we remove the 
> Weight variants - override in all extending classes.
> # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
> true.
> # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
> method to return BS2 or BS based on the number of required scorers and 
> setAllowOutOfOrder.
> # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
> true/false.
> #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
> to create the appropriate Scorer, using the new QueryWeight.
> #* Provide a static create method to TFC and TSDC which accept this as an 
> argument and creates the proper instance.
> #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
> Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
> create the optimized Collector instance.
> # Modify IndexSearcher to use all of the above logic.
> The only class I'm worried about, and would like to verify with you, is 
> Searchable. If we want to deprecate all the search methods on IndexSearcher, 
> Searcher and Searchable which accept Weight and add new ones which accept 
> QueryWeight, we must do the following:
> * Deprecate Searchable in favor of Searcher.
> * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
> break back-compat and add them as abstract (like we've done with the new 
> Collector method) or (2) add them with a default impl to call the Weight 
> versions, documenting these will become abstract in 3.0.
> * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
> Searcher. That's the part I'm a little bit worried about - Searchable 
> implements java.rmi.Remote, which means there could be an implementation out 
> there which implements Searchable and extends something different than 
> UnicastRemoteObject, like Activeable. I think there is very small chance this 
> has actually happened, but would like to confirm with you guys first.
> * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
> and delegates all calls to the Searchable member.
> * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
> old ones to use SearchableWrapper.
> * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
> regarding overriding these new methods.
> One other optimization that was discussed in LUCENE-1593 is to expose a 
> topScorer() API (on Weight) which returns a Scorer that its score(Collector

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720538#action_12720538
 ] 

Michael Busch commented on LUCENE-1693:
---

OK, what about this sentence in Token.java:

{code:java}
  When caching a reusable token, clone it. When injecting a cached token into a 
stream that can be reset, clone it again.
{code}

This double-cloning is exactly what CachingTokenFilter and Tee/Sink do, so they 
preserve the actual Token class type.
You can easily construct an example similar to the tool I attached that uses 
these streams. 



> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@luce

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720534#action_12720534
 ] 

Uwe Schindler commented on LUCENE-1693:
---

Hi Michael,
in principle your test is invalid. It has other tokenfilter over which the user 
has no control in it. With the two mentioned filters it may work, because they 
do not change the reuseableToken. But the API clearly states, that the 
reuseableToken must not be used and another one returned.
So this is really unsupported behaviour. If you remove the filters in between, 
it would work correct. And this could even fail with 2.4 if you put other 
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small 
problems with (unsupported) usage. The API does exactly, what is menthioned in 
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you 
loose nothing...

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeS

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720534#action_12720534
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/17/09 12:39 AM:
-

Hi Michael,
in principle your test is invalid. It has other tokenfilters in the chain, 
which the user has no control on. With the two mentioned filters it may work, 
because they do not change the reuseableToken instance. But the API clearly 
states, that the reuseableToken must not be used and another one can be 
returned.
So this is really unsupported behaviour. If you remove the filters in between, 
it would work correct. And this could even fail with 2.4 if you put other 
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small 
problems with (unsupported) usage. The API does exactly, what is menthioned in 
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you 
loose nothing...

  was (Author: thetaphi):
Hi Michael,
in principle your test is invalid. It has other tokenfilter over which the user 
has no control in it. With the two mentioned filters it may work, because they 
do not change the reuseableToken. But the API clearly states, that the 
reuseableToken must not be used and another one returned.
So this is really unsupported behaviour. If you remove the filters in between, 
it would work correct. And this could even fail with 2.4 if you put other 
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small 
problems with (unsupported) usage. The API does exactly, what is menthioned in 
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you 
loose nothing...
  
> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-17 Thread Michael Busch


On 6/15/09 10:10 AM, Grant Ingersoll wrote:


 But, as Michael M reminded me, it is complex, so please accept my 
apologies.




No worries, Grant! I was not really offended, but rather confused... 
Thanks for clarifying.


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720530#action_12720530
 ] 

Michael Busch commented on LUCENE-1693:
---

But I'll definitely buy Uwe a beer if he comes up with solution that is more 
elegant and doesn't have the mentioned disadvantages! :)

> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720529#action_12720529
 ] 

Michael Busch commented on LUCENE-1693:
---

I don't think we mention subclassing of Token really in the documentation. We 
also certainly don't prevent it. The tool I wrote works fine with 2.4, if you 
add other filters to the chain it might not work anymore. But since we don't 
promise that subclassing of Token works everywhere, that's probably fine.

We're deprecating the old API anyway, so we shouldn't have to introduce new 
stuff to fully support subclassing Token.

My point here is just that this is a very complex API (even though it looks 
pretty simple). When I wrote the new TokenStream API patch end of last year I 
thought about all these possibilities of making backwards compatibility more 
elegant. But I wanted to be certain to not break any runtime behavior or affect 
performance negatively. Therefore I decided to not mess with the old API, but 
rather put the burden of implementing both APIs on the committers during the 
transition phase. I know this is somewhat annoying, on the other hand, how 
often do we really add new TokenFilters to the core? Often implementing 
incrementToken() takes 10 minutes if you already have next() implemented. Just 
copy&paste and change a few things.



> AttributeSource/TokenStream API improvements
> 
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, lucene-1693.patch, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized

95 matches

Mail list logo