[jira] Updated: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1692: Attachment: LUCENE-1692.txt adds tests for thaianalyzer token offsets and types, both of which have bugs! tests for correct behavior are included but commented out. > Contrib analyzers need tests > > > Key: LUCENE-1692 > URL: https://issues.apache.org/jira/browse/LUCENE-1692 > Project: Lucene - Java > Issue Type: Test > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt > > > The analyzers in contrib need tests, preferably ones that test the behavior > of all the Token 'attributes' involved (offsets, type, etc) and not just what > they do with token text. > This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Sorry, last patch was invalid (did not compile), I forgot to to revert some changes before posting. Attached patch has still problems in TeeTokenStream, SinkTokenizer and CachingTokenFilter (see before), but fixes: - double cloning of payloads - the first of your tests works correct, even if i remove next() from StopFilter and/or LowercaseFilter > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsub
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: (was: LUCENE-1693.patch) > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Synchronizing Lucene indexes across 2 application servers
I've a web application which uses Lucene for search functionality. Lucene search requests are served by web services sitting on 2 application servers (IIS 7).The 2 application servers are Load balanced using "netscaler". Both these servers have a batch job running which updates search indexes on the respective servers in the night on a daily basis. I need to synchronize search indexes on these 2 servers so that at any point of time both the servers have uptodate indexes. I was thinking what could be the best architecture/design strategy to do so given the fact that any of the 2 application servers could be serving search request depending upon its availability. Any inputs please? Thanks for reading! -- View this message in context: http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721078#action_12721078 ] Shai Erera commented on LUCENE-1595: bq. I still want to run some tests with the wikipedia stuff I added readContentSource.alg just for that purpose and ran it over the Wikipedia dump. All documents were read successfully. bq. Removed modification to core Document class Nice ! I don't know how I missed that getFields().clear() option. > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, > LUCENE-1595.patch, LUCENE-1595.patch > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1628) Persian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721061#action_12721061 ] Mark Miller commented on LUCENE-1628: - Okay, fair enough. I figured you'd know better than me, just wanted to check. Certainly if we have other code that way, no reason to change it here. And of course it makes sense that you would still run into issues with the comments - garbalage at best. I only ever use apply to/from clipboard so I have luckily never seen that issue :) We should be good to put this in then - I'll wait till we get squared away with the new token api patch then commit. > Persian Analyzer > > > Key: LUCENE-1628 > URL: https://issues.apache.org/jira/browse/LUCENE-1628 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1628.patch, LUCENE-1628.patch > > > A simple persian analyzer. > i measured trec scores with the benchmark package below against > http://ece.ut.ac.ir/DBRG/Hamshahri/ : > SimpleAnalyzer: > SUMMARY > Search Seconds: 0.012 > DocName Seconds:0.020 > Num Points: 981.015 > Num Good Points: 33.738 > Max Good Points: 36.185 > Average Precision: 0.374 > MRR:0.667 > Recall: 0.905 > Precision At 1: 0.585 > Precision At 2: 0.531 > Precision At 3: 0.513 > Precision At 4: 0.496 > Precision At 5: 0.486 > Precision At 6: 0.487 > Precision At 7: 0.479 > Precision At 8: 0.465 > Precision At 9: 0.458 > Precision At 10:0.460 > Precision At 11:0.453 > Precision At 12:0.453 > Precision At 13:0.445 > Precision At 14:0.438 > Precision At 15:0.438 > Precision At 16:0.438 > Precision At 17:0.429 > Precision At 18:0.429 > Precision At 19:0.419 > Precision At 20:0.415 > PersianAnalyzer: > SUMMARY > Search Seconds: 0.004 > DocName Seconds:0.011 > Num Points: 987.692 > Num Good Points: 36.123 > Max Good Points: 36.185 > Average Precision: 0.481 > MRR:0.833 > Recall: 0.998 > Precision At 1: 0.754 > Precision At 2: 0.715 > Precision At 3: 0.646 > Precision At 4: 0.646 > Precision At 5: 0.631 > Precision At 6: 0.621 > Precision At 7: 0.593 > Precision At 8: 0.577 > Precision At 9: 0.573 > Precision At 10:0.566 > Precision At 11:0.572 > Precision At 12:0.562 > Precision At 13:0.554 > Precision At 14:0.549 > Precision At 15:0.542 > Precision At 16:0.538 > Precision At 17:0.533 > Precision At 18:0.527 > Precision At 19:0.525 > Precision At 20:0.518 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1628) Persian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721057#action_12721057 ] Robert Muir commented on LUCENE-1628: - mark: thanks for the followup on the licenses! wrt non-english text, I will say that if you set encoding to UTF-8 (such as in eclipse under project>properties>text encoding) then things are fine. the ant build also does the right thing, and there are definitely other analyzers that behave like this too, and will break if things aren't set right. also, if you do not set encoding to UTF-8, most editors (such as eclipse) will not be able to save the file, and will error out with encoding issues... even if the text is inside a comment! not really (ok a little) trying to talk you out of this, but I'm just not sure it would really help anything... that being said... (my) eclipse still jacks up if you team->apply patch from file. if you open the patch in notepad, ctrl-a,ctrl-c, and then team->apply patch from clipboard, it works fine... very annoying! > Persian Analyzer > > > Key: LUCENE-1628 > URL: https://issues.apache.org/jira/browse/LUCENE-1628 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1628.patch, LUCENE-1628.patch > > > A simple persian analyzer. > i measured trec scores with the benchmark package below against > http://ece.ut.ac.ir/DBRG/Hamshahri/ : > SimpleAnalyzer: > SUMMARY > Search Seconds: 0.012 > DocName Seconds:0.020 > Num Points: 981.015 > Num Good Points: 33.738 > Max Good Points: 36.185 > Average Precision: 0.374 > MRR:0.667 > Recall: 0.905 > Precision At 1: 0.585 > Precision At 2: 0.531 > Precision At 3: 0.513 > Precision At 4: 0.496 > Precision At 5: 0.486 > Precision At 6: 0.487 > Precision At 7: 0.479 > Precision At 8: 0.465 > Precision At 9: 0.458 > Precision At 10:0.460 > Precision At 11:0.453 > Precision At 12:0.453 > Precision At 13:0.445 > Precision At 14:0.438 > Precision At 15:0.438 > Precision At 16:0.438 > Precision At 17:0.429 > Precision At 18:0.429 > Precision At 19:0.419 > Precision At 20:0.415 > PersianAnalyzer: > SUMMARY > Search Seconds: 0.004 > DocName Seconds:0.011 > Num Points: 987.692 > Num Good Points: 36.123 > Max Good Points: 36.185 > Average Precision: 0.481 > MRR:0.833 > Recall: 0.998 > Precision At 1: 0.754 > Precision At 2: 0.715 > Precision At 3: 0.646 > Precision At 4: 0.646 > Precision At 5: 0.631 > Precision At 6: 0.621 > Precision At 7: 0.593 > Precision At 8: 0.577 > Precision At 9: 0.573 > Precision At 10:0.566 > Precision At 11:0.572 > Precision At 12:0.562 > Precision At 13:0.554 > Precision At 14:0.549 > Precision At 15:0.542 > Precision At 16:0.538 > Precision At 17:0.533 > Precision At 18:0.527 > Precision At 19:0.525 > Precision At 20:0.518 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721054#action_12721054 ] Mark Miller commented on LUCENE-1696: - Patch looks good! I'll just hold off till the token api improvement patch is finished, just in case we need to make an adjustment here. > Added New Token API impl for ASCIIFoldingFilter > --- > > Key: LUCENE-1696 > URL: https://issues.apache.org/jira/browse/LUCENE-1696 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.9 >Reporter: Simon Willnauer >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: ASCIIFoldingFilter._newTokenAPI.patch, > TestGermanCollation.java > > > I added an implementation of incrementToken to ASCIIFoldingFilter.java and > extended the existing testcase for it. > I will attach the patch shortly. > Beside this improvement I would like to start up a small discussion about > this filter. ASCIIFoldingFitler is meant to be a replacement for > ISOLatin1AccentFilter which is quite nice as it covers a superset of the > latter. I have used this filter quite often but never on a as it is basis. In > the most cases this filter does the correct thing (replace a special char > with its ascii correspondent) but in some cases like for German umlaut it > does not return the expected result. A german umlaut like 'ä' does not > translate to a but rather to 'ae'. I would like to change this but I'n not > 100% sure if that is expected by all users of that filter. Another way of > doing it would be to make it configurable with a flag. This would not affect > performance as we only check if such a umlaut char is found. > Further it would be really helpful if that filter could "inject" the > original/unmodified token with the same position increment into the token > stream on demand. I think its a valid use-case to index the modified and > unmodified token. For instance, the german word "süd" would be folded to > "sud". In a query q:(süd) the filter would also fold to sud and therefore > find sud which has a totally different meaning. Folding works quite well but > for special cases would could add those options to make users life easier. > The latter could be done in a subclass while the umlaut problem should be > fixed in the base class. > simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721052#action_12721052 ] Mark Miller commented on LUCENE-1599: - Well yuck. SpanNearQuery does this clone call in its rewrite method but there is no clone impl - so it looks like it returns a SpanNearQuery with the same clauses instance. So it looks like this gets tangled up with the real query, and the real query gets modified to the rewritten form for the rewrite on searchable2. I think anyway. I wanted to just test a fix to if that was right, but SpanNearQuery can contain any span queries, so I guess all of them might need clone impls and we may have to clone the whole chain? A little tired to think about it at the moment ;) Looks like the issue is with the cloning in SpanNearQuery though. > SpanRegexQuery and SpanNearQuery is not working with MultiSearcher > -- > > Key: LUCENE-1599 > URL: https://issues.apache.org/jira/browse/LUCENE-1599 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4.1 > Environment: lucene-core 2.4.1, lucene-regex 2.4.1 >Reporter: Billow Gao >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: TestSpanRegexBug.java > > Original Estimate: 2h > Remaining Estimate: 2h > > MultiSearcher is using: > queries[i] = searchables[i].rewrite(original); > to rewrite query and then use combine to combine them. > But SpanRegexQuery's rewrite is different from others. > After you call it on the same query, it always return the same rewritten > queries. > As a result, only search on the first IndexSearcher work. All others are > using the first IndexSearcher's rewrite queries. > So many terms are missing and return unexpected result. > Billow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1628) Persian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721048#action_12721048 ] Mark Miller commented on LUCENE-1628: - Looks pretty good. Not sure if we should update to the new token api here or just commit and hit it with the other issue. I guess we might as well get it here first. Is it better to put the raw text in there like that (in the tests) or do you think it would be better to use utf8 codes with maybe the raw text in a comment? I'm just remembering running into issues with such things in a past life as I moved around source code. > Persian Analyzer > > > Key: LUCENE-1628 > URL: https://issues.apache.org/jira/browse/LUCENE-1628 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1628.patch, LUCENE-1628.patch > > > A simple persian analyzer. > i measured trec scores with the benchmark package below against > http://ece.ut.ac.ir/DBRG/Hamshahri/ : > SimpleAnalyzer: > SUMMARY > Search Seconds: 0.012 > DocName Seconds:0.020 > Num Points: 981.015 > Num Good Points: 33.738 > Max Good Points: 36.185 > Average Precision: 0.374 > MRR:0.667 > Recall: 0.905 > Precision At 1: 0.585 > Precision At 2: 0.531 > Precision At 3: 0.513 > Precision At 4: 0.496 > Precision At 5: 0.486 > Precision At 6: 0.487 > Precision At 7: 0.479 > Precision At 8: 0.465 > Precision At 9: 0.458 > Precision At 10:0.460 > Precision At 11:0.453 > Precision At 12:0.453 > Precision At 13:0.445 > Precision At 14:0.438 > Precision At 15:0.438 > Precision At 16:0.438 > Precision At 17:0.429 > Precision At 18:0.429 > Precision At 19:0.419 > Precision At 20:0.415 > PersianAnalyzer: > SUMMARY > Search Seconds: 0.004 > DocName Seconds:0.011 > Num Points: 987.692 > Num Good Points: 36.123 > Max Good Points: 36.185 > Average Precision: 0.481 > MRR:0.833 > Recall: 0.998 > Precision At 1: 0.754 > Precision At 2: 0.715 > Precision At 3: 0.646 > Precision At 4: 0.646 > Precision At 5: 0.631 > Precision At 6: 0.621 > Precision At 7: 0.593 > Precision At 8: 0.577 > Precision At 9: 0.573 > Precision At 10:0.566 > Precision At 11:0.572 > Precision At 12:0.562 > Precision At 13:0.554 > Precision At 14:0.549 > Precision At 15:0.542 > Precision At 16:0.538 > Precision At 17:0.533 > Precision At 18:0.527 > Precision At 19:0.525 > Precision At 20:0.518 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1628) Persian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721044#action_12721044 ] Mark Miller commented on LUCENE-1628: - bq. mark, on the same topic: if possible, at some time it would be great to know which licenses are OK, and which ones are not. Found it. No Problem: * Apache License 2.0 * ASL 1.1 * BSD * MIT/X11 * NCSA * W3C Software license * X.Net * zlib/libpng with some hassle: * CDDL 1.0 * CPL 1.0 * EPL 1.0 * IPL 1.0 * MPL 1.0 and MPL 1.1 * SPL 1.0 http://www.apache.org/legal/3party.html > Persian Analyzer > > > Key: LUCENE-1628 > URL: https://issues.apache.org/jira/browse/LUCENE-1628 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1628.patch, LUCENE-1628.patch > > > A simple persian analyzer. > i measured trec scores with the benchmark package below against > http://ece.ut.ac.ir/DBRG/Hamshahri/ : > SimpleAnalyzer: > SUMMARY > Search Seconds: 0.012 > DocName Seconds:0.020 > Num Points: 981.015 > Num Good Points: 33.738 > Max Good Points: 36.185 > Average Precision: 0.374 > MRR:0.667 > Recall: 0.905 > Precision At 1: 0.585 > Precision At 2: 0.531 > Precision At 3: 0.513 > Precision At 4: 0.496 > Precision At 5: 0.486 > Precision At 6: 0.487 > Precision At 7: 0.479 > Precision At 8: 0.465 > Precision At 9: 0.458 > Precision At 10:0.460 > Precision At 11:0.453 > Precision At 12:0.453 > Precision At 13:0.445 > Precision At 14:0.438 > Precision At 15:0.438 > Precision At 16:0.438 > Precision At 17:0.429 > Precision At 18:0.429 > Precision At 19:0.419 > Precision At 20:0.415 > PersianAnalyzer: > SUMMARY > Search Seconds: 0.004 > DocName Seconds:0.011 > Num Points: 987.692 > Num Good Points: 36.123 > Max Good Points: 36.185 > Average Precision: 0.481 > MRR:0.833 > Recall: 0.998 > Precision At 1: 0.754 > Precision At 2: 0.715 > Precision At 3: 0.646 > Precision At 4: 0.646 > Precision At 5: 0.631 > Precision At 6: 0.621 > Precision At 7: 0.593 > Precision At 8: 0.577 > Precision At 9: 0.573 > Precision At 10:0.566 > Precision At 11:0.572 > Precision At 12:0.562 > Precision At 13:0.554 > Precision At 14:0.549 > Precision At 15:0.542 > Precision At 16:0.538 > Precision At 17:0.533 > Precision At 18:0.527 > Precision At 19:0.525 > Precision At 20:0.518 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1695) Update the Highlighter to use the new TokenStream API
[ https://issues.apache.org/jira/browse/LUCENE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1695: Attachment: LUCENE-1695.patch Pretty much done, all tests pass. It breaks back compat, but frankly, straddling doesn't seem worth the effort here. Or even very possible. You can't really give new methods to use for the deprecated ones, and deprecating by class would be a real nuisance as we would lose class names I'd rather keep. We have no back compat policy, and I think its worth just pushing this to the new API. I was also thinking about breaking back compat with changing the Highlighter to use the SpanScorer, so doing it all in one shot would be nice. The overall migration should be fairly simple once you understand the new TokenFilter API. I'll handle it for Solr. Still needs either its own changes file to explain or could go in the contrib common changes file. There is a change to the MemoryIndex to get around issues with the new/old API and CachingTokenFilters. Ill have to see how the new TokenFilter API improvements issue works out before doing a final patch for this. > Update the Highlighter to use the new TokenStream API > - > > Key: LUCENE-1695 > URL: https://issues.apache.org/jira/browse/LUCENE-1695 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Mark Miller >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1695.patch, LUCENE-1695.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1595: Attachment: LUCENE-1595.patch Added to changes a bit Removed modification to core Document class updated deletepercent.alg to new alg changes fixed a couple comment typos set to use content.source.forever rather than doc.maker.forever in ExtractWikipedia#main(String[] args) the sort algs don't work :( unrelated to this patch and related to our deprecation of the auto sort field - Ryan just hit that over in solr-land too. I still want to run some tests with the wikipedia stuff, but still waiting for that mondo file to download :) Looks pretty nice overall, thanks Shai! > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, > LUCENE-1595.patch, LUCENE-1595.patch > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721024#action_12721024 ] Jason Rutherglen commented on LUCENE-1700: -- Taking a step back, maybe we can solve the package protected SegmentInfo issue here by creating a new class with the necessary attributes? Here's what LUCENE-1313 does: {code} SegmentReader sr = writer.readerPool.getIfExists(info); if (info.hasDeletions() || (sr != null && sr.hasDeletions())) { {code} Because SegmentInfo is package protected it seems ok to access a package protected method (or in this case variable) in IndexWriter. > LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the > SegmentReader > > > Key: LUCENE-1700 > URL: https://issues.apache.org/jira/browse/LUCENE-1700 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Trivial > Fix For: 2.9 > > Original Estimate: 48h > Remaining Estimate: 48h > > With LUCENE-1516, deletes are carried over in the SegmentReaders > which means implementations of > MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy) > need to obtain deletion info from the SR (instead of from the > SegmentInfo which won't have the information). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader
LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader Key: LUCENE-1700 URL: https://issues.apache.org/jira/browse/LUCENE-1700 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Trivial Fix For: 2.9 With LUCENE-1516, deletes are carried over in the SegmentReaders which means implementations of MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy) need to obtain deletion info from the SR (instead of from the SegmentInfo which won't have the information). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Near Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch The patch is cleaned up. A static variable IndexWriter.GLOBALNRT is added, which allows all the tests to be run with flushToRAM=true. I reran the tests which hopefully still work as intended. Tests that looked for specific file names were changed to work with NRT. Some of the tests are skipped entirely and need to be written specifically for flushToRAM. * TestIndexWriterMergePolicy,TestBackwardsCompatibility failures are expected * TestIndexWriterRAMDir.testFSDirectory fails (will be fixed) * TestThreadedOptimize ensureContiguousMerge fails. This one is a bit mysterious, perhaps the correct assertion will show where it's going wrong. I need to go through and mark the tests that can be converted to be NRT specific. > Near Realtime Search > > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Enable near realtime search in Lucene without external > dependencies. When RAM NRT is enabled, the implementation adds a > RAMDirectory to IndexWriter. Flushes go to the ramdir unless > there is no available space. Merges are completed in the ram > dir until there is no more available ram. > IW.optimize and IW.commit flush the ramdir to the primary > directory, all other operations try to keep segments in ram > until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720984#action_12720984 ] Michael Busch commented on LUCENE-1693: --- Go to bed, I'll review later... in meetings now... > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Small updates, before I go to sleep. This patch removes the incrementTokenAPI from the three caching classes. It also fixes the double cloning of the payload in next() when the token is cloned directly. There is still one small problem, that your test -- I hate it... :-( -- fails again, if I remove next(Token) from StopFilter or LowerCaseFilter. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsub
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720952#action_12720952 ] Uwe Schindler commented on LUCENE-1693: --- The second test does not work, because it always uses per default incrementToken. By the way, the APIdocs and behaviour changed with these three classes, TeeTokenFilter, SinkTokenizer and CachingTokenFilter: e.g. getTokens() does not return what is noted. For backwards-compatiblility we should deprecate the current versions of these class [and only let them implement next(Token)]. They can then be used even together with the new API, but they always work on Token instances. When I remove incrementToken from them your test passes complete. For the new API there should be new classes, that use attributesource and restorestate to cache and so on. But for current backwards compatibility (you mentioned, somebody have written a similar thing): If the user's class only uses next(Token) it will work as before. The problem is mixed implementations of old/new API and different cache contents. This is not a problem of my proposal! Again: We should remove the double implementations everywhere. In these special cases with caches, where the cache should contain a specific class (Tokens or AttributeSource.State), two classes are needed, one deprecated. But: what do you think about my latest patch in general? > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720926#action_12720926 ] Uwe Schindler commented on LUCENE-1693: --- Exactly: The problem is in SinkTokenizer. when calling next(Token) the result is casted to Token, which does not work (the iterator only contains either Tokens or States, dependent on what was added. As SinkTokenizer and TeeTokenFilter may use different APIs it crahes. The problem with the test is, that depending on chaining with old/new APIs the iter may conatin wron type. This can be fixed by removing next(Token) (preferred) or incrementToken() . The problem is that dependent on chaining it is not clear which method is called and the new/old API should not share the same state information. Because the problem is related new/old API, we should simply remove the old API from both filters, so they share the same instances in all cases! Then we do not need UOE. I will look into and check, why the Token in the second test is not preserverd > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720919#action_12720919 ] Michael Busch commented on LUCENE-1693: --- Btw: SinkTokenizer in my patch has a small bug too. I need to throw a UOE in incrementToken() if it was filled using the old API. It should probably also throw a UOE when someone tries to fill it with both, old and new API streams. And that this is not allowed must be made clear in the javadocs. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720913#action_12720913 ] Michael Busch commented on LUCENE-1693: --- You can probably fix CachingTokenFilter and tee/sink to behave correctly. But please remember that a user might have their own implementations of something like a CachingTokenFilter or tee/sink, which must keep working. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1693: -- Attachment: TestCompatibility.java Slightly changes tool yields on 2.4 and identically on trunk + my patch: {noformat} new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api {noformat} On trunk + your latest patch: {noformat} new tokenstream --> proper noun api new tokenstream api Exception in thread "main" java.lang.ClassCastException: org.apache.lucene.util.AttributeSource$State at org.apache.lucene.analysis.SinkTokenizer.next(SinkTokenizer.java:97) at org.apache.lucene.analysis.TestCompatibility.consumeStream(TestCompatibility.java:97) at org.apache.lucene.analysis.TestCompatibility.main(TestCompatibility.java:90) {noformat} It runs three tests. The first is good with your patch; the second doesn;t seem to preserve the right Token subclass; the third throws a ClassCastException. I haven't debugged why... > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should s
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Here my solution: The three default methods are now optimized to use the shortest path to the by subclasses implemented iteration method. The implemented iteration methods are determined by reflection in initialize(). Cloning now only done, if next() is directly called by a consumer, in all other cases the reuseableToken is used for passing the attributes around. The new TokenStream also checks in initialize, that one of the "abstract" methods is overridden. Because of this TestIndexWriter and the inverter singleton state was updated to at least have an empty incrementToken(). Because of this check, nobody can create a TokenStream, that loops indefinite after calling next() because no pseuso-abstract method was overridden. As incrementToken will be abstract in future, it must always be implemented, and this is what I have done. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass,
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720887#action_12720887 ] Michael Busch commented on LUCENE-1693: --- I'm not convinced yet that we will be able to remove the implementations of next() and next(Token). Mark, I'm not familiar with what changes you need to make to the highlighter, but you should not rely yet on the fact that next() and nextToken() won't have to be implemented anymore. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Michael Busch wrote: Everyone who is unhappy with the release TODO's, go back in your mail archive to the 2.2 release and check how many tedious little changes we made to improve the release quality. And besides the maven stuff, there is not really more to do compared to pre-2.2, it's just documented in a more verbose (=RM-friendly) way. I didn't mean to imply anything untowards :) I'm grateful for the work you guys have put into making it all more friendly. I know I have seen many of Mike M's wiki updates on this page too, and I've always been sure its for the better. Even still, when I look at the process, I remember why I clung to Windows for so long :) Now I'm happily on Ubuntu and can still usually avoid such "fun" work :) I'll happily soldier on though. I just wish it was all in Java :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720854#action_12720854 ] Uwe Schindler commented on LUCENE-1693: --- bq. Should I wait to put in the Highlighter update till you guys are done here? You can start with highlighter, if this patch goes through, we can remove the next() methods from all tokenizers. For consumers like the highlighter, there will be no need anymore to switch between old/new api. Just use the new API, it will also work with old tokenizers. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additiona
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720849#action_12720849 ] Mark Miller commented on LUCENE-1693: - Should I wait to put in the Highlighter update till you guys are done here? > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720846#action_12720846 ] Uwe Schindler commented on LUCENE-1693: --- I have a solution to build in some shortcuts: in initialize I use reflection (see the earlier patch) to find out, which of the three methods is implemented (check if this.getClass().getMethod(name,params).getDeclaringClass() == TokenStream.class, when this is true, the method was *not* overridden). in incrementToken() the method checks if either next(Token) or next() is implemented and calls direct. The same in the other classes. next() should be ideally never called then. I will post a patch later. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue onl
[jira] Updated: (LUCENE-1625) openReaderPassed not populated in CheckIndex.Status.SegmentInfoStatus
[ https://issues.apache.org/jira/browse/LUCENE-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1625: -- Attachment: CheckIndex.patch Attached patch for exposing all collected stats (created with svn diff > CheckIndex.patch (please correct me if this is not the right way (this is my first patch))) This patch breaks out the testing of field norms, terms, stored fields, and term vectors into their own methods it also creates a status object for each one of these tests to make the results transparent this status object exposes: * stats previously only available from infoStream * exception thrown if test fails (null if test was successful) each SegmentInfoStatus will have these status objects attached NOTE: This patch allows that if one of the above tests fails, it will attempt to keep testing (to find all failures) any failure will still result in the overall segment being rejected > openReaderPassed not populated in CheckIndex.Status.SegmentInfoStatus > - > > Key: LUCENE-1625 > URL: https://issues.apache.org/jira/browse/LUCENE-1625 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Tim Smith > Attachments: CheckIndex.patch > > > When using CheckIndex programatically, the openReaderPassed flag on the > SegmentInfoStatus is never populated (so it always comes back false) > looking at the code, its clear that openReaderPassed is defined, but never > used > furthermore, it appears that not all information that is propagated to the > "InfoStream" is available via SegmentIinfoStatus > All of the following information should be able to be gather from public > properties on the SegmentInfoStatus: > test: open reader.OK > test: fields, norms...OK [2 fields] > test: terms, freq, prox...OK [101 terms; 133 terms/docs pairs; 133 tokens] > test: stored fields...OK [100 total field count; avg 1 fields per doc] > test: term vectorsOK [0 total vector count; avg 0 term/freq vector > fields per doc] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720822#action_12720822 ] Uwe Schindler commented on LUCENE-1693: --- I could change the calling chain: incrementToken() calls next() calls next(Token), would this be better. next(Token) would per default set the delegate to the reuseable token. hmhm - thinking about it. Where is then the degradion? > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720820#action_12720820 ] Michael Busch commented on LUCENE-1693: --- {quote} Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease. {quote} Yes that's what I mean. But I think this will almost be the most common use case: I would think most users have chains that mix core streams/filters with custom filters. Also I assume most users who need high performance switched from next() to next(Token) by now. These users will see a performance degradation, which I predict will be similar or worse as going back to using next(), unless they implement the new API in their filters right away. So those users will see a performance hit if they just do a drop-in replacement of the lucene jar. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. S
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720811#action_12720811 ] Uwe Schindler commented on LUCENE-1693: --- Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional command
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720809#action_12720809 ] Uwe Schindler commented on LUCENE-1693: --- The code is almost identical to before, the old code also copied the token to make it a full private copy. There are three modes of operation: - if incrementToken is implemented, docinverter will use it (the code always calls incrementToken, so no indirection) - if next(Token) is implemented, the docinverterwill call incrementToken which is forwarded to next(Token), which is cheap - if only next() is implemented, the docinverter will call incrementTojen, which forwards to next(Token) and this forwards to next(). But this is identical to before, only one indirection more: the old code got useNewAPI(false) and called next(Token) which forwarded to next() So for indexing using the normal indexing components (docinverter), the code is never cloning more that with your code. There is one other case: if you have an old consumer calling nextToken(Token), the tokenizer only implemented incrementToken, then you will get a performance degradion. But this is not the indexing case, it is e.g. reusing the tokenizer in a very old e.g. QueryParser. I did not find a good way to pass directly for this special case to incrementToken(). The problem is also, that incrementToken uses the internal buffer and not the supplied buffer. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note a
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720794#action_12720794 ] Michael Busch commented on LUCENE-1693: --- I'm looking at TokenStream.next(): {code:java} public Token next(final Token reusableToken) throws IOException { // We don't actually use reusableToken, but still add this assert assert reusableToken != null; checkTokenWrapper(); return next(); } /** Returns the next token in the stream, or null at EOS. * @deprecated The returned Token is a "full private copy" (not * re-used across calls to next()) but will be slower * than calling {...@link #next(Token)} instead.. */ public Token next() throws IOException { checkTokenWrapper(); if (incrementToken()) { final Token token = (Token) tokenWrapper.delegate.clone(); Payload p = token.getPayload(); if (p != null) { token.setPayload((Payload) p.clone()); } return token; } return null; } {code} This seems like a big performance hit for users of the old API, no? Now every single Token will be cloned, even if they implement next(Token), as soon as the users have one filter in the chain that doesn't implement the new API yet. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even si
Re: Lucene 2.9 Again
+1 Michael On 6/17/09 10:32 AM, Mark Miller wrote: Michael Busch wrote: We should just not put more items in the 2.9 list anymore (except bug fixes of course) and then fix the 30 issues and don't rush them too much. If it takes until end of July I think that's acceptable. A good quality of the release should be highest priority in my opinion. Michael I agree. Our approach so far has not been to rush the issues that are outstanding, but to pressure a move to 3.1 if you don't think you can finish it reasonably soon. I'd expect the committers to stick with their normal standards for committing code, and I plan too as well. On the other hand, its also probably not a great idea for a bunch of huge changes to hit trunk right before release with no time to go though dev use. So I still think that, unless its an important issue for 2.9 speficially, if you can't finish it by fairly early julyish - you should push to 3.1. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Michael Busch wrote: We should just not put more items in the 2.9 list anymore (except bug fixes of course) and then fix the 30 issues and don't rush them too much. If it takes until end of July I think that's acceptable. A good quality of the release should be highest priority in my opinion. Michael I agree. Our approach so far has not been to rush the issues that are outstanding, but to pressure a move to 3.1 if you don't think you can finish it reasonably soon. I'd expect the committers to stick with their normal standards for committing code, and I plan too as well. On the other hand, its also probably not a great idea for a bunch of huge changes to hit trunk right before release with no time to go though dev use. So I still think that, unless its an important issue for 2.9 speficially, if you can't finish it by fairly early julyish - you should push to 3.1. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
That means the release frequency should not exceed the new-committer frequency. :) On 6/17/09 10:09 AM, Mark Miller wrote: Michael Busch wrote: One?!? I did 2.2, 2.3, 2.3.1, 2.3.2! What can you do ... there was no new guy to relieve you :) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
We should just not put more items in the 2.9 list anymore (except bug fixes of course) and then fix the 30 issues and don't rush them too much. If it takes until end of July I think that's acceptable. A good quality of the release should be highest priority in my opinion. Michael On 6/17/09 10:09 AM, Mark Miller wrote: Michael Busch wrote: wanted to get 2.9 out really really soon. really, really is probably not totally accurate. I just know how things can get drawn out. Even still, we have 30 some issues to resolve. If we don't make a drive though, when will 2.9 come out? Next fall at the earliest? Later? So much goodness to give to the users out there already. And Java 1.5 waiting for us. And removing all of these deprecations. We don't have to release tomorrow, but lets get this out there! - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Michael Busch wrote: wanted to get 2.9 out really really soon. really, really is probably not totally accurate. I just know how things can get drawn out. Even still, we have 30 some issues to resolve. If we don't make a drive though, when will 2.9 come out? Next fall at the earliest? Later? So much goodness to give to the users out there already. And Java 1.5 waiting for us. And removing all of these deprecations. We don't have to release tomorrow, but lets get this out there! -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Michael Busch wrote: One?!? I did 2.2, 2.3, 2.3.1, 2.3.2! What can you do ... there was no new guy to relieve you :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On 6/17/09 6:23 AM, Mark Miller wrote: I have a special gift in not being clear. I was just saying "be prepared, your turn is coming ;) " But I havn't done a release myself - we don't release that often despite discussion that we should release more often every year or so. I did notice though, that Mike did the release right after joining, and Michael did a release right after joining, and so ... looks like I am next in line followed by you. One?!? I did 2.2, 2.3, 2.3.1, 2.3.2! Everyone who is unhappy with the release TODO's, go back in your mail archive to the 2.2 release and check how many tedious little changes we made to improve the release quality. And besides the maven stuff, there is not really more to do compared to pre-2.2, it's just documented in a more verbose (=RM-friendly) way. The maven stuff is also pretty simple... just for signing the artifacts I hacked a tool, because that gets tedious otherwise. When we're at that point I can try to dig it up... I think Mike has such a tool too. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
I'm happy to hear that :) I suggested 2-3 weeks to prevent the schedule from being even tighter, as it sounded like you guys wanted to get 2.9 out really really soon. I'm really busy the rest of June and will have much more time for Lucene in July. So if we could wait until end of July before we do the code freeze, and get 2.9 out early August, that'd mean much less sleep deprivation for me! And the likelihood that I'll get all my stuff in would be much higher... Michael On 6/17/09 5:43 AM, Michael McCandless wrote: On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote: How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days testing, so 2.9 should be out mid July? Sounds reasonable? This schedule might be tight for me... I'm "on vacation" for the week starting Jun 29. Hopefully I can most of my issues done before then, but that's a week and a half left at this point :) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720696#action_12720696 ] Robert Muir commented on LUCENE-1692: - michael, ok. I know additional tests here (against the old api) might be more code to convert, but I think it will actually make the process easier, whenever that is or whatever is involved. i have some time this evening to try to improve the coverage here (against the old api). > Contrib analyzers need tests > > > Key: LUCENE-1692 > URL: https://issues.apache.org/jira/browse/LUCENE-1692 > Project: Lucene - Java > Issue Type: Test > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1692.txt, LUCENE-1692.txt > > > The analyzers in contrib need tests, preferably ones that test the behavior > of all the Token 'attributes' involved (offsets, type, etc) and not just what > they do with token text. > This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720692#action_12720692 ] Shai Erera commented on LUCENE-1693: You can run tokenize.alg which invokes the ReadTokenTask, which iterates on a TokenStream. You'll probably need to modify the .alg file to create a different analyzer/token stream each time, and I think this can be done by the "rounds" syntax in benchmark. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Let's not forget Nutch... Also, for that matter, Mahout uses Lucene's Analysis and Core (in fact, I just committed MAHOUT-126 which allows one to create Vectors from a Lucene index!), although those are just as consumers, I doubt there is a need for Mahout committers to change Lucene. On Jun 17, 2009, at 10:04 AM, Michael McCandless wrote: I agree. I'm picturing some hopefully-not-that-distant future when we have a queries "module" and analysis "module" that live quite separately from Lucene & Solr's "core", and committers from both Solr and Lucene would work on it. Mike On Wed, Jun 17, 2009 at 9:01 AM, Grant Ingersoll wrote: On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote: I would love to see function queries consolidated between Solr and Lucene! I think it's a prime example of duplicated and then diverged sources between Lucene and Solr... The primary reason it's diverged is it gets a lot of attention on Solr and near zero in Lucene. You rarely see someone on java-user ask about function queries. In Solr, it's a regular solution to many problems. So, just like the analysis problem, it strikes me as one of those areas that if it is going to be done, and maintained, then Solr committers need write access. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Jun 17, 2009, at 10:11 AM, Yonik Seeley wrote: On Wed, Jun 17, 2009 at 8:57 AM, Grant Ingersoll wrote: On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote: There are parts that aren't strictly part of the release process IMO - things like maven seem optional. -1. Maven support is not optional. I can't always follow Lucene closely, but i'm pretty sure it never became mandatory in Solr, and it's never been a part of any kind of ASF release requirements. It's nice if the release manager feels like doing it... but it also seems like it can be done after the fact (for maven or other release mechanisms) by those who care more about those. It's pretty much the only way I consume Lucene and Solr anymore... So, yeah, I'll make sure it happens. In Solr and Lucene, generating the artifacts is automatic anyway. The only manual part is copying them up to the server. I think people can handle doing an scp. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1673: -- Attachment: LUCENE-1673.patch Here some intermediate update... > Move TrieRange to core > -- > > Key: LUCENE-1673 > URL: https://issues.apache.org/jira/browse/LUCENE-1673 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, > LUCENE-1673.patch > > > TrieRange was iterated many times and seems stable now (LUCENE-1470, > LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to > its default FieldTypes (SOLR-940) and if possible I want to move it to core > before release of 2.9. > Before this can be done, there are some things to think about: > # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how > should they be called in core? I would suggest to leave it as it is. On the > other hand, if this keeps our only numeric query implementation, we could > call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here > are problems). Same for the TokenStreams and Filters. > # Maybe the pairs of classes for indexing and searching should be moved into > one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The > problem here: ctors must be able to pass int, long, double, float as range > parameters. For the end user, mixing these 4 types in one class is hard to > handle. If somebody forgets to add a L to a long, it suddenly instantiates a > int version of range query, hitting no results and so on. Same with other > types. Maybe accept java.lang.Number as parameter (because nullable for > half-open bounds) and one enum for the type. > # TrieUtils move into o.a.l.util? or document or? > # Move TokenStreams into o.a.l.analysis, ShiftAttribute into > o.a.l.analysis.tokenattributes? Somewhere else? > # If we rename the classes, should Solr stay with Trie (because there are > different impls)? > # Maybe add a subclass of AbstractField, that automatically creates these > TokenStreams and omits norms/tf per default for easier addition to Document > instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Wed, Jun 17, 2009 at 8:57 AM, Grant Ingersoll wrote: > On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote: >> There are parts that aren't strictly part of the release process IMO - >> things like maven seem optional. > > -1. Maven support is not optional. I can't always follow Lucene closely, but i'm pretty sure it never became mandatory in Solr, and it's never been a part of any kind of ASF release requirements. It's nice if the release manager feels like doing it... but it also seems like it can be done after the fact (for maven or other release mechanisms) by those who care more about those. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
I agree. I'm picturing some hopefully-not-that-distant future when we have a queries "module" and analysis "module" that live quite separately from Lucene & Solr's "core", and committers from both Solr and Lucene would work on it. Mike On Wed, Jun 17, 2009 at 9:01 AM, Grant Ingersoll wrote: > > On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote: > >> I would love to see function queries consolidated between Solr and >> Lucene! I think it's a prime example of duplicated and then diverged >> sources between Lucene and Solr... > > The primary reason it's diverged is it gets a lot of attention on Solr and > near zero in Lucene. You rarely see someone on java-user ask about function > queries. In Solr, it's a regular solution to many problems. So, just like > the analysis problem, it strikes me as one of those areas that if it is > going to be done, and maintained, then Solr committers need write access. > > -Grant > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720676#action_12720676 ] Uwe Schindler commented on LUCENE-1693: --- Hi Michael, I did not do any performance tests until now, I think you have the better knowledge about measuring tokenization performance. Important would be to compare perf of: - Old API with useNewAPI=true - Old API with useNewAPI=false - My impl with defaults (onlyUseNewAPI=false) - My impl with onlyUseNewAPI=true For all tests, you should only use conformant streams (e.g. from core). An good additional test would be to create a chain that has completely implemented incrementToken() and one only suplying next() for some chain entries. Is this hard to do? > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You
[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1630: --- Attachment: LUCENE-1630.patch * Collector's acceptDocsOutOfOrder is abstract - this was a really good change since I completely forgot to override it in all home brewed Collectors to return true where applicable. I also surprised to see that <5 collectors actually should return false (most of them in tests). * I added QueryWeight variants to Searchable and implemented in RemoteSearchable. * Mike - I'm afraid I did some more code cleanup (not much though) - that was before I saw your last comment. sorry * Handled the rest of the latest comments. All tests pass > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, > LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialize
Re: Lucene 2.9 Again
I have a special gift in not being clear. I was just saying "be prepared, your turn is coming ;) " But I havn't done a release myself - we don't release that often despite discussion that we should release more often every year or so. I did notice though, that Mike did the release right after joining, and Michael did a release right after joining, and so ... looks like I am next in line followed by you. I'd be happy to split some of the work if its possible though - then perhaps we can both get our feet wet without having the full load of that wiki. I'm up for either way. Looks like we have some time to work it out. - Mark Uwe Schindler wrote: Uwe Schindler wrote: Maybe Mark helps me and I can do it alone the next time, if I have to? :-) Tag team effort ? It will be my first release to, so that would be great ! Ah ok, I interpreted your mail different yesterday (but it was 1 or 2 am in Germany...). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 2:11 PM, Grant Ingersoll wrote: More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. So, I think I see #1 covered, how about #2, #3 and the notion of an AttributeTermQuery? Anyone have thoughts on those? I might have some time next week to work up a Query, as it sounds like fun, but don't hold it to me just yet. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 Again
> Uwe Schindler wrote: > > Maybe Mark helps me and I can do > > it alone the next time, if I have to? :-) > > > Tag team effort ? It will be my first release to, so that would be great ! Ah ok, I interpreted your mail different yesterday (but it was 1 or 2 am in Germany...). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Jun 17, 2009, at 4:42 AM, Michael McCandless wrote: I would love to see function queries consolidated between Solr and Lucene! I think it's a prime example of duplicated and then diverged sources between Lucene and Solr... The primary reason it's diverged is it gets a lot of attention on Solr and near zero in Lucene. You rarely see someone on java-user ask about function queries. In Solr, it's a regular solution to many problems. So, just like the analysis problem, it strikes me as one of those areas that if it is going to be done, and maintained, then Solr committers need write access. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 Again
> On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote: > > > On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller > > wrote: > > There are parts that aren't strictly part of the release process IMO - > > things like maven seem optional. > > -1. Maven support is not optional. > > +1 for more automation. For the record, once setup, Maven (as opposed > to Ant) release (i.e. on Mahout http://cwiki.apache.org/MAHOUT/how-to- > release.html) > consists of far fewer steps. The only manual ones after one-time > setup are the announcements and the copy from staging to release (and > even that, I think, can be done better using Nexus). Note, I'm not > voting to change to Maven, just saying there is room for automation. Please no maven! :( Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
Uwe Schindler wrote: Maybe Mark helps me and I can do it alone the next time, if I have to? :-) Tag team effort ? It will be my first release to, so that would be great ! -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Jun 16, 2009, at 7:16 PM, Yonik Seeley wrote: On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller wrote: There are parts that aren't strictly part of the release process IMO - things like maven seem optional. -1. Maven support is not optional. +1 for more automation. For the record, once setup, Maven (as opposed to Ant) release (i.e. on Mahout http://cwiki.apache.org/MAHOUT/how-to-release.html) consists of far fewer steps. The only manual ones after one-time setup are the announcements and the copy from staging to release (and even that, I think, can be done better using Nexus). Note, I'm not voting to change to Maven, just saying there is room for automation. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 Again
I tend also to a little bit later; maybe we need more discussions about NumericField and NumericSortField, especially between the two fractions Mike vs. Yonik :-) After finishing the TokenStream simplification and optimization, I will now again start rewriting of javadocs for trie and hopefully I can commit in a day-or-two(TM). Maybe start RCs in second quarter of July? Maybe Mark helps me and I can do it alone the next time, if I have to? :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Wednesday, June 17, 2009 2:43 PM > To: java-dev@lucene.apache.org > Subject: Re: Lucene 2.9 Again > > On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote: > > > How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days > > testing, so 2.9 should be out mid July? Sounds reasonable? > > This schedule might be tight for me... I'm "on vacation" for the week > starting Jun 29. Hopefully I can most of my issues done before then, > but that's a week and a half left at this point :) > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote: > How soon is soon? Code freeze in 2-3 weeks or so maybe? Then 7-10 days > testing, so 2.9 should be out mid July? Sounds reasonable? This schedule might be tight for me... I'm "on vacation" for the week starting Jun 29. Hopefully I can most of my issues done before then, but that's a week and a half left at this point :) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: (was: LUCENE-1693.patch) > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Sorry, small bug in cloning inside next(): the POSToken-test was failing again. But now it works also correct. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Attachment: LUCENE-1693.patch Attached is a new patch, that implements the last idea: - There is no more copying of Tokens, so the API should have the same speed (almost) as before. - Per default, the chain of TokenStreams/TokenFilters can be mixed completely (test that explicitely tests this is still missing), the drawback is, that there is only *one* attribute instance called TokenWrapper (package private) that manages the exchange of the Token instance behind. - If the user knows, that all tokenizers in his JVM implement incrementToken and do not fallback to next(), he can increase speed by using the static setter setOnlyUseNewAPI(true). In this case, no single TokenWrapper is initialized and code will use the normal Attribute factory to generate the Attributes. If some old code is still available or your consumer calls next(), you will get an UOE during tokenization. The same happens, if you override initialize() and instantiate your attributes manually without super.initialize(). - When the old API is removed, TokenWrapper and large parts inside TokenStream can be removed and incrementToken() made abstract. This is identical to setting onlyUseNewAPI to true. - the api setting can only be static, because the attribute instances are generated during construction of the streams and so a later downgrade to TokenWrapper is not possible. Documentation inside this patch enforce, that at least all core tokenizers and consumers are conformant, so one must be able to set TokenStream.setOnlyUseNewAPI to true and then use StandardAnalyzer without any problem. When contrib is transformed, we can extend this to contrib. Because the code wraps the old API completely, all converted streams can be changed to only implement only incrementToken() using attributes. Super's TokenStream.next() and next(Token) manage the rest. There is no speed degradion by this, it is safe to remove (and all will be happy)! Uwe > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved i
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720632#action_12720632 ] Shai Erera commented on LUCENE-1630: {quote} You forgot to fill in the "?" in CHANGES I guess you're looking at the previous patch. It already has your name in the latest {quote} Sorry, you're right - there are two sections in CHANGES which I've added text to, and I put your name in the second one only. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialized in > every call to next(), skipTo() and score(). Personally I prefer to see that > in a separate issue, following that one (as it might add methods to > QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a c
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720629#action_12720629 ] Shai Erera commented on LUCENE-1630: bq. You forgot to fill in the "?" in CHANGES I guess you're looking at the previous patch. It already has your name in the latest :) bq. How come {{Document doc(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException}} is added to Searcher.java in your patch? It's leftover from when I first deprecated Searchable - I wanted to move all the methods from Searchable to Searcher so that we don't forget that later. Will remove it. bq. Rethinking fixing Searchable now vs later Ok I will do that. Deprecate the current ones and add new ones. We need to keep the Weight-variant methods in, since someone might call it. If he doesn't extend Searcher or implement Searchable, there's no real break in back-compat for him. bq. As much as I love all the little code cleanups Apologies ... I'll try to restrain myself. That's why I didn't want to make Collector.accepts..() abstract - it would force me to touch more files, which means more code cleanups ;). I'll do my best to stop. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary ch
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720622#action_12720622 ] Shai Erera commented on LUCENE-1630: It isn't and that's what I expressed in the javadocs. If you plan to iterate on a Scorer, you should always ask for in-order one, and that's what IndexSearcher does. Mike suggested above to refine that documentation to say that if you plan to call nextDoc() only, you can still ask for an out-of-order scorer. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialized in > every call to next(), skipTo() and score(). Personally I prefer to see that > in a separate issue, following that one (as it might add methods to > QueryWeight). -- This message is automatically generated by JIRA. - Yo
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720623#action_12720623 ] Michael McCandless commented on LUCENE-1630: Still working through the patch... here's what I found so far: * You forgot to fill in the "?" in CHANGES :) * Can you change the default for BooleanQuery.allowDocsOutOfOrder to true? * How come {{Document doc(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException}} is added to Searcher.java in your patch? * Rethinking fixing Searchable now vs later: first off, we've already changed the interface in 2.9 (added Collector); second off, in our changes with Fieldable we both changed our policy and the interface, in one release. Maybe we should in fact switch to QueryWeight? (I'm not sure). This patch already breaks back compat of Searcher (there are new abstract methods), anyway. * Instead of saying "there is a chance" in the javadoc in BQ, can you change it to say "BQ will return an out-of-order scorer if requested"? (There's no chance in the matter...). * In fact, DocumentsWriter very much needs for the docs to be scored in order (it breaks out of the loop on the first out-of-bounds doc). Can you put that back? * As much as I love all the little code cleanups, can you resist the temptation, especially in such large patches as this? I think a separate issue that does pure code cleanups would be a great way to fix these, going forward... * "not need anymore" --> "not needed anymore" * We can now make things final in BS2, like countingSumScorer, *Scorers, etc? > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out >
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720619#action_12720619 ] Earwin Burrfoot commented on LUCENE-1630: - I wasn't following the issue closely, so this question might by silly - how does out-of-order scoring/collection marry with filters? If I remember right, filter/scorer intersection relies on proper orderness. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector) > will be called, and additionally add a start() method to DISI. That will > allow Scorers to initialize either on start() or score(Collector). This was > proposed mainly because of BS and BS2 which check if they are initialized in > every call to next(), skipTo() and score(). Personally I prefer to see that > in a separate issue, following that one (as it might add methods to > QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ---
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720612#action_12720612 ] Shai Erera commented on LUCENE-1630: Ok I will change acceptsDocsOutOfOrder on Collector to abstract, and implement it in all core collectors. I've already changed BooleanWeight's impl, as I wrote above "I fixed BooleanWeight to return true if there is a chance it will return BS (i.e. there are no required clauses and <32 prohibited clauses)". I still don't think scoresOutOfOrder can live on Scorer. IndexSearcher's search methods all call eventually to search(QueryWeight, Filter, Collector), which means that by that time you should already have a Collector ready (note that the user may pass its own Collector). Therefore such a utility will not work for user provided collectors, and specifically this method creates a Scorer for a given reader, but never a Collector (and a Collector is created just once). So if we were to take your approach, it'd deviate the "fast search methods" from the other search methods. The others would call search(Weight, Filter, Collector) and the "fast ones" would not (since they don't have a Collector yet). This will complicate IndexSearcher's code, IMO unnecessarily. If we want to differentiate the two, I can do that w/o a helper class. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720607#action_12720607 ] Michael McCandless commented on LUCENE-1630: {quote} bq. Can we make Collector.supportsDocsOutOfOrder abstract? Defaulting to false isn't great (I'd rather subclass think about the question). In general, I tried to avoid it since that would require changing all core Collectors. There aren't many, but still ... This goes for QueryWeight.scoresOutOfOrder - wanted to avoid changing all core Weights to impl the method w/ "return false". I actually think that many Weights/Scorers do score documents in-order, hence the default impl. {quote} OK... thinking more about it, I think having QueryWeight.scoresDocsOutOfOrder default to "false" is reasonable (I think most do in-order scoring). Also, I think the perf gains are relatively small if a QueryWeight returns "true", so, by defaulting to false we're not leaving much performance on the table. But for Collector it's a different story: the gains by allowing BooleanQuery to use its out-of-order scorer are sizable. And, I'd expect many custom Collectors would be fine with out-of-order collection. Since these are brand new classes, we have the chance to do it well. It's very much an expert thing already to make your own Collector... {quote} bq. If a given Scorer.scoresOutOfOrder returns true, does that mean nextDoc is allowed to return docs out of order? When you deal with a Scorer which returns out-of-order, you can only call scorer.score(Collector). If you're going to iterate, you're going to have to create a Scorer in-order, and that's what IndexSearcher does. I'll spell it out clearly in the javadocs. {quote} That may be a bit too strong -- eg BooleanScorer lets you nextDoc() your way through its out-of-order docs (just not advance()). Maybe state just that you can't use advance in the javadocs? {quote} bq. Should scoresOutOfOrder() move from QueryWeight --> Scorer? We've discussed it few posts up. When this information in in Scorer, I should first ask for a Scorer, and only then I can create a Collector. If I'll use the Scorer immediately, then that'll be ok. However, that's not the case in IndexSearcher, and results in a bug in Spatial, and unless we want to uglify IndexSearcher code, it seemed that this can sit in QueryWeight. But I do think it's a problematic method in QW too, since if it returns false by default, I'll create a Collector which expects docs in-order, but then I'd lose the optimization in BooleanWeight which may return an out-of-order superior Scorer. If I return true, I'll create a Collector which expects out-of-order, and the Scorer (again, an example from BW) may be actually in-order, and I've wasted unnecessary 'if doc > topDoc' cycles. So I don't know what's better: make IndexSearcher code more complicated or sacrifice a potential loss of this optimization? {quote} Could we "invert" the logic in IndexSearcher that makes a collector, eg by creating a utility class that will on-demand provide a collector once told whether the docs will be in order? Basically, "curry" all the other details about the collector (sorting by score vs field, if by field whether to track scores & max score). Then inside doSearch when we finally know if the Scorer will be in-order, we ask that helper class for the collector? The first time the helper class is called, it makes the collector; subsequent times it returns the same one. There is a risk, though, if the Scorer returned for a given segment "changes its mind"... eg the first segment's scorer says the docs will be in order, and then some later segment's scorer says they will not be in order. So... that's risky. Maybe we leave it on QueryWeight, but fix BooleanWeight to return exactly the right thing? (It can be exact, right? Because we know the conditions under which BooleanWeight, if allowed to do so, would choose to return an out-of-order scorer). {quote} bq. Shouldn't Searchable cutover to QueryWeight too? (We are keeping Searchable, but allowing changes to it) I wrote that above too - I don't think we can declare and execute right in 2.9 that Searchable can be changed unexpectedly. So I added a NOTE to its javadocs and thought to do the change post 2.9, when we remove Weight. We'd be forced to change these methods to QueryWeight, and fix RemoteSearchable too. And it will be consistent w/ our back-compat policy (at least the part where we declare on an upcoming change before it happens). But if you think otherwise, I don't mind deprecating and adding new methods (I've got used to it already, I almost do it blindly ). {quote} [Sorry, I'm losing track of all the comments] OK let's defer the changes to Searchable until 3.1. Make sure you open a follow-on issue so we remember ;) > Mating Collector and Scorer on doc Id orderness >
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720593#action_12720593 ] Michael McCandless commented on LUCENE-1673: bq. Want a convenience method for the user? TrieUtils.createDocumentField(...) , same as the sortField currently works. I don't think this is "convenient" enough. bq. If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs +1 Long (medium?) term I'd love to get to this point; I think it'd make Lucene quite a bit more consumable. But we shouldn't sacrifice consumability today on the hope for that future nirvana. You already have a nice starting point here... is that something you could donate? {quote} bq. I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!) How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible. {quote} I completely agree: you're not supposed to get the original doc back. And the fact that Lucene's API now "pretends" you do, is wrong. We all agree to that, and that we need to fix Lucene. But, as things now stand, it's not yet fixed, so until it's fixed, I don't like intentionally making it worse. It'd be great to simply stop returning Document from IndexReader. Wanna make a patch? I don't think the new sheriff'd hold 2.9 for this though ;) {quote} bq. "hey how come I didn't get a NumericField back on my doc? Perhaps a good reason to not add a NumericField. {quote} I think NumericField (when building your doc) is still valuable, even if we can't return NumericField when you retrieve the doc. OK... since adding the bit to the stored fields is controversial, I think for 2.9, we should only add NumericField at indexing (document creation) time. So, we don't store a new bit in stored fields file and the index format is unchanged. > Move TrieRange to core > -- > > Key: LUCENE-1673 > URL: https://issues.apache.org/jira/browse/LUCENE-1673 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch > > > TrieRange was iterated many times and seems stable now (LUCENE-1470, > LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to > its default FieldTypes (SOLR-940) and if possible I want to move it to core > before release of 2.9. > Before this can be done, there are some things to think about: > # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how > should they be called in core? I would suggest to leave it as it is. On the > other hand, if this keeps our only numeric query implementation, we could > call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here > are problems). Same for the TokenStreams and Filters. > # Maybe the pairs of classes for indexing and searching should be moved into > one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The > problem here: ctors must be able to pass int, long, double, float as range > parameters. For the end user, mixing these 4 types in one class is hard to > handle. If somebody forgets to add a L to a long, it suddenly instantiates a > int version of range query, hitting no results and so on. Same with other > types. Maybe accept java.lang.Number as parameter (because nullable for > half-open bounds) and one enum for the type. > # TrieUtils move into o.a.l.util? or document or? > # Move TokenStreams into o.a.l.analysis, ShiftAttribute into > o.a.l.analysis.tokenattributes? Somewhere else? > # If we rename the classes, should Solr stay with Trie (because there are > different impls)? > # Maybe add a subclass of AbstractField, that automatically creates these > TokenStreams and omits norms/tf per default for easier addition to Document > instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1699) Field tokenStream should be usable with stored fields.
[ https://issues.apache.org/jira/browse/LUCENE-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720578#action_12720578 ] Michael McCandless commented on LUCENE-1699: Patch looks good: * Can you make sure CHANGES describes this new behavior (Field is allowed to have both a tokenStream and a String/Reader/binary value)? * The javadoc for readerValue is wrong (copy/paste from stringValue) * Can you spell out more clearly in the javadocs that even when a tokenStream value is set, one of String/Reader/binary may also be set, or, not, and if so, that "other" value is only used for stored fields. Eg, explain why one would use setTokenStream instead of setValue(TokenStream value). > Field tokenStream should be usable with stored fields. > -- > > Key: LUCENE-1699 > URL: https://issues.apache.org/jira/browse/LUCENE-1699 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1699.patch > > > Field.tokenStream should be usable for indexing even for stored values. > Useful for many types of pre-analyzed values (text/numbers, etc) > http://search.lucidimagination.com/search/document/902bad4eae20bdb8/field_tokenstreamvalue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720574#action_12720574 ] Michael McCandless commented on LUCENE-1673: Note that LUCENE-1505 is open for cutting over contrib/spacial to NumericUtils > Move TrieRange to core > -- > > Key: LUCENE-1673 > URL: https://issues.apache.org/jira/browse/LUCENE-1673 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch > > > TrieRange was iterated many times and seems stable now (LUCENE-1470, > LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to > its default FieldTypes (SOLR-940) and if possible I want to move it to core > before release of 2.9. > Before this can be done, there are some things to think about: > # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how > should they be called in core? I would suggest to leave it as it is. On the > other hand, if this keeps our only numeric query implementation, we could > call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here > are problems). Same for the TokenStreams and Filters. > # Maybe the pairs of classes for indexing and searching should be moved into > one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The > problem here: ctors must be able to pass int, long, double, float as range > parameters. For the end user, mixing these 4 types in one class is hard to > handle. If somebody forgets to add a L to a long, it suddenly instantiates a > int version of range query, hitting no results and so on. Same with other > types. Maybe accept java.lang.Number as parameter (because nullable for > half-open bounds) and one enum for the type. > # TrieUtils move into o.a.l.util? or document or? > # Move TokenStreams into o.a.l.analysis, ShiftAttribute into > o.a.l.analysis.tokenattributes? Somewhere else? > # If we rename the classes, should Solr stay with Trie (because there are > different impls)? > # Maybe add a subclass of AbstractField, that automatically creates these > TokenStreams and omits norms/tf per default for easier addition to Document > instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1505) Remove NumberUtils from spatial contrib
[ https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720572#action_12720572 ] Michael McCandless commented on LUCENE-1505: LUCENE-1496 is "won't fix" because trie's NumericUtils subsumes Solr's NumberUtils, ie, we now need to migrate local lucene to NumericUtils. And we want to do this for 2.9, since local lucene is not yet released and we have the freedom to make such an otherwise drastic change to the index format. I'll update this issue to reflect it's new goal. > Remove NumberUtils from spatial contrib > --- > > Key: LUCENE-1505 > URL: https://issues.apache.org/jira/browse/LUCENE-1505 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/spatial >Reporter: Ryan McKinley >Assignee: Simon Willnauer > Fix For: 2.9 > > > Currently spatial contrib includes a copy of NumberUtils from solr (otherwise > it would depend on solr) > Once LUCENE-1496 is sorted out, this copy should be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
[ https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1505: --- Fix Version/s: 2.9 Summary: Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils (was: Remove NumberUtils from spatial contrib) > Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils > - > > Key: LUCENE-1505 > URL: https://issues.apache.org/jira/browse/LUCENE-1505 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/spatial >Reporter: Ryan McKinley >Assignee: Simon Willnauer > Fix For: 2.9 > > > Currently spatial contrib includes a copy of NumberUtils from solr (otherwise > it would depend on solr) > Once LUCENE-1496 is sorted out, this copy should be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1692) Contrib analyzers need tests
[ https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720570#action_12720570 ] Michael McCandless commented on LUCENE-1692: Robert, you should probably also hold up on API conversion, since the API itself is now changing (LUCENE-1693). > Contrib analyzers need tests > > > Key: LUCENE-1692 > URL: https://issues.apache.org/jira/browse/LUCENE-1692 > Project: Lucene - Java > Issue Type: Test > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1692.txt, LUCENE-1692.txt > > > The analyzers in contrib need tests, preferably ones that test the behavior > of all the Token 'attributes' involved (offsets, type, etc) and not just what > they do with token text. > This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720571#action_12720571 ] Michael Busch commented on LUCENE-1693: --- {quote} I am working on that, I have a meeting now, after that. {quote} Good luck. I'm off to bed... > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720569#action_12720569 ] Michael McCandless commented on LUCENE-1516: {quote} Currently we check the info for deletes, however with this patch, I think we need to check the segmentReader which could have deletes that don't show up in the info. {quote} Good catch! Can you open a new issue & attach patch? Though: how would you do this? Right now MergePolicy never receives a SegmentReader, and makes all its decisions based on the SegmentInfo. Each SegmentReader tracks its own pendingDelCount... maybe we add a private pendingDelCount to SegmentInfo, and change SegmentReader to use that instead? That'd be a single source, and then the merge policy could retrieve it... > Integrate IndexReader with IndexWriter > --- > > Key: LUCENE-1516 > URL: https://issues.apache.org/jira/browse/LUCENE-1516 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, > LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png > > Original Estimate: 672h > Remaining Estimate: 672h > > The current problem is an IndexReader and IndexWriter cannot be open > at the same time and perform updates as they both require a write > lock to the index. While methods such as IW.deleteDocuments enables > deleting from IW, methods such as IR.deleteDocument(int doc) and > norms updating are not available from IW. This limits the > capabilities of performing updates to the index dynamically or in > realtime without closing the IW and opening an IR, deleting or > updating norms, flushing, then opening the IW again, a process which > can be detrimental to realtime updates. > This patch will expose an IndexWriter.getReader method that returns > the currently flushed state of the index as a class that implements > IndexReader. The new IR implementation will differ from existing IR > implementations such as MultiSegmentReader in that flushing will > synchronize updates with IW in part by sharing the write lock. All > methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720568#action_12720568 ] Uwe Schindler commented on LUCENE-1693: --- bq. I think you should try it out and see if you run into problems. This should not be much code to write. I am working on that, I have a meeting now, after that. bq. You might have to do tricks with Tee/Sink, if the sink is wrapped by a filter with the new API, but the tee wraps a stream with the old API, or vice versa. This is currently working without any problems, but I want to add a test-case, that explicitely chains some dummy-filters in deprecated and not-deprecated form and looks whats coming out. But it should work. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to t
Re: Lucene 2.9 Again
On Wed, Jun 17, 2009 at 10:42 AM, Michael McCandless wrote: > I would love to see function queries consolidated between Solr and > Lucene! I think it's a prime example of duplicated and then diverged > sources between Lucene and Solr... > > And it's fabulous that you are "volunteering", Simon ;) We have > precious few volunteers that stride both communities well enough, and > have the itch, to do this. > > So I'd love to see progress made towards this but I also think > it's a little too big to hold up 2.9 for. Yeah I agree! > > The back compat requirement is certainly important, but I would assume > workable, ie it should not hold up this consolidation... I think this is a step by step task and it should be done with back compat in mind. I think it is not crucial to have it in 2.9 as solr might be keen to get 1.5 lucene releases integrated too. So its not a big deal if it gets integrated with 3.* releases. > > Mike > > On Wed, Jun 17, 2009 at 4:27 AM, Simon > Willnauer wrote: >> On Tue, Jun 16, 2009 at 11:47 PM, Yonik >> Seeley wrote: >>> On Tue, Jun 16, 2009 at 5:38 PM, Simon >>> Willnauer wrote: I was thinking of adding a patch for https://issues.apache.org/jira/browse/LUCENE-1085 >>> >>> That's *way* too big of an issue and it breaks back compat in Solr (to >>> change from Solr's to Lucene's version - I know many people who have >>> implemented and plugged in their own functions.) >> Do you have a pointer to back compat policy in solr or is it the same >> as in Lucene?! >> >> simon >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720562#action_12720562 ] Michael Busch commented on LUCENE-1693: --- For caching: I guess you would have to implement the wrapper's clone() method such that it returns what delegate.clone() returns. This would put a clone of the original Token (or subclass) into the cache, instead a clone of the wrapper, which is good. Then the second clone also clones the original Token again and put's it into a second wrapper that the CachingTokenStream owns. Hmm complicated, but should work. Need to think more about if all mixes of old and new TokenSteams would work... and if this approach affects performance in any way or changes runtime behavior of corner cases... Gosh, this is like running a huge backwards-compatibility junit test suite in my head every time we consider a different approach. :) I think you should try it out and see if you run into problems. This should not be much code to write. You might have to do tricks with Tee/Sink, if the sink is wrapped by a filter with the new API, but the tee wraps a stream with the old API, or vice versa. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change,
Re: Lucene 2.9 Again
On Tue, Jun 16, 2009 at 7:16 PM, Yonik Seeley wrote: > On Tue, Jun 16, 2009 at 6:37 PM, Mark Miller wrote: >> I've looked at the release todo wiki and I am still having nightmares. > > Indeed - it's gotten 5 times longer since the last time I did Lucene or Solr. > There are parts that aren't strictly part of the release process IMO - > things like maven seem optional. For better or worse, it gets bigger whenever someone (recently, me!) makes a silly mistake and then goes and updates the release todo ;) I do think it could use some consolidating, though... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Tue, Jun 16, 2009 at 6:06 PM, Michael Busch wrote: > Cool, seems like Mark is volunteering to be the 2.9 release manager ;) Yay! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
I would love to see function queries consolidated between Solr and Lucene! I think it's a prime example of duplicated and then diverged sources between Lucene and Solr... And it's fabulous that you are "volunteering", Simon ;) We have precious few volunteers that stride both communities well enough, and have the itch, to do this. So I'd love to see progress made towards this but I also think it's a little too big to hold up 2.9 for. The back compat requirement is certainly important, but I would assume workable, ie it should not hold up this consolidation... Mike On Wed, Jun 17, 2009 at 4:27 AM, Simon Willnauer wrote: > On Tue, Jun 16, 2009 at 11:47 PM, Yonik > Seeley wrote: >> On Tue, Jun 16, 2009 at 5:38 PM, Simon >> Willnauer wrote: >>> I was thinking of adding a patch for >>> https://issues.apache.org/jira/browse/LUCENE-1085 >> >> That's *way* too big of an issue and it breaks back compat in Solr (to >> change from Solr's to Lucene's version - I know many people who have >> implemented and plugged in their own functions.) > Do you have a pointer to back compat policy in solr or is it the same > as in Lucene?! > > simon >> >> -Yonik >> http://www.lucidimagination.com >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: madvise(ptr, len, MADV_SEQUENTIAL)
I think readahead would be less interesting to Lucene; while we definitely want a certain amount of readahead (to "amortize" the seeking), too much readahead starts means evicting things from the IO cache. OSs already do a fair job (I think) of some amount of readahead, though if we do gain posix_fadvise in Java and we use it to advise to not IO cache those reads, I wonder how that impacts the OS's readahead... Some serious empirical testing is needed. Let the machines tell us how they work ;) Mike On Tue, Jun 16, 2009 at 11:20 PM, Jason Rutherglen wrote: > Sorry, not portable, but POSIX_FADV_WILLNEED is which can be used with > posix_fadvise. > > On Tue, Jun 16, 2009 at 8:12 PM, Jason Rutherglen > wrote: >> >> Perhaps we'd also like to request readahead be included in JDK7? >> >> http://linux.die.net/man/2/readahead >> >> On Tue, Jun 16, 2009 at 9:03 AM, Michael McCandless >> wrote: >>> >>> Hmm... posix_fadvise lets you do this with a file descriptor; this >>> would be better for Lucene (per descriptor not per mapped region of >>> RAM) since we could "advise" independent of which FSDir impl is in >>> use... >>> >>> Mike >>> >>> On Tue, Jun 16, 2009 at 10:32 AM, Uwe Schindler wrote: >>> > But to use it, we should change MMapDirectory to also use the mapping >>> > when >>> > writing to files. I thought about it, it is very simple to implement >>> > (just >>> > copy the IndexInput and change all gets() to sets()) >>> > >>> > - >>> > Uwe Schindler >>> > H.-H.-Meier-Allee 63, D-28213 Bremen >>> > http://www.thetaphi.de >>> > eMail: u...@thetaphi.de >>> > >>> >> -Original Message- >>> >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >>> >> Sent: Tuesday, June 16, 2009 4:22 PM >>> >> To: java-dev@lucene.apache.org >>> >> Cc: Alan Bateman; nio-disc...@openjdk.java.net >>> >> Subject: Re: madvise(ptr, len, MADV_SEQUENTIAL) >>> >> >>> >> Lucene could really make use of this method. When a segment merge >>> >> takes place, we can read & write many GB of data, which without >>> >> madvise on many OSs would effectively flush the IO cache (thus hurting >>> >> our search performance). >>> >> >>> >> Mike >>> >> >>> >> On Mon, Jun 15, 2009 at 6:01 PM, Jason >>> >> Rutherglen wrote: >>> >> > Thanks Alan. >>> >> > >>> >> > I cross posted this to the Lucene dev list where we are discussing >>> >> > using >>> >> > madvise for minimizing unnecessary IO cache usage when merging >>> >> > segments >>> >> > (where we really want the newly merged segments in the IO cache >>> >> > rather >>> >> than >>> >> > the old segment files). >>> >> > >>> >> > How would the advise method work? Would there need to be a hint in >>> >> > the >>> >> > FileChannel.map method? >>> >> > >>> >> > -J >>> >> > >>> >> > On Mon, Jun 15, 2009 at 12:36 AM, Alan Bateman >>> >> > >>> >> wrote: >>> >> >> >>> >> >> Jason Rutherglen wrote: >>> >> >>> >>> >> >>> Is there going to be a way to do this in the new Java IO APIs? >>> >> >> >>> >> >> Good question, as it has come up a few times and is needed for some >>> >> >> important use-cases. A while back I looked into adding a >>> >> >> MappedByteBuffer#advise method to allow the application provide >>> >> >> hints >>> >> on the >>> >> >> expected usage but didn't complete it. We should probably look at >>> >> >> this >>> >> again >>> >> >> for jdk7. >>> >> >> >>> >> >> -Alan. >>> >> >> >>> >> > >>> >> > >>> >> >>> >> - >>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> > >>> > >>> > >>> > - >>> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> > >>> > >>> >>> - >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >> > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 Again
On Tue, Jun 16, 2009 at 11:47 PM, Yonik Seeley wrote: > On Tue, Jun 16, 2009 at 5:38 PM, Simon > Willnauer wrote: >> I was thinking of adding a patch for >> https://issues.apache.org/jira/browse/LUCENE-1085 > > That's *way* too big of an issue and it breaks back compat in Solr (to > change from Solr's to Lucene's version - I know many people who have > implemented and plugged in their own functions.) Do you have a pointer to back compat policy in solr or is it the same as in Lucene?! simon > > -Yonik > http://www.lucidimagination.com > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720550#action_12720550 ] Uwe Schindler commented on LUCENE-1693: --- OK, I have a solution: I write a wrapper class (a reference) that implement all token attribute interfaces but pass this downto the wrapped Token/Subclass-of-Token. Instead of cloning the token when wrapping the return value of next(), I could simply put it into the wrapper. The instance keeps the same, only the delegate is different. Outside users or TokenStreams using the new API, will only see one instance that implements all interfaces. (in principle the same like your backwards-compatibility thing in the docinverter) Would this be an idea? > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to
[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1630: --- Attachment: LUCENE-1630.patch Fixed most of your comments Mike. I also noticed I did not document Collector.acceptsDocsOutOfOrder, so fixed that too. The remaining things we should agree on are: * deprecated Weight and add QueryWeight variants to Searchable. I prefer to do it post 2.9. * move scoresDocsOutOfOrder to Scorer instead of Weight. I fixed BooleanWeight to return true if there is a chance it will return BS (i.e. there are no required clauses and <32 prohibited clauses). I guess we'll need to discuss that one more. * Make Collector.acceptsDocsOutOfOrder and QueryWeight.scoresDocsOutOfOrder abstract - I think the default impl makes sense for most of the imps out there and the ones in core, but I don't have a strong feeling against making it abstract. All tests pass, and javadocs are good as well. > Mating Collector and Scorer on doc Id orderness > --- > > Key: LUCENE-1630 > URL: https://issues.apache.org/jira/browse/LUCENE-1630 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch > > > This is a spin off of LUCENE-1593. This issue proposes to expose appropriate > API on Scorer and Collector such that one can create an optimized Collector > based on a given Scorer's doc-id orderness and vice versa. Copied from > LUCENE-1593, here is the list of changes: > # Deprecate Weight and create QueryWeight (abstract class) with a new > scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) > method. QueryWeight implements Weight, while score(reader) calls > score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) > is defined abstract. > #* Also add QueryWeightWrapper to wrap a given Weight implementation. This > one will also be deprecated, as well as package-private. > #* Add to Query variants of createWeight and weight which return QueryWeight. > For now, I prefer to add a default impl which wraps the Weight variant > instead of overriding in all Query extensions, and in 3.0 when we remove the > Weight variants - override in all extending classes. > # Add to Scorer isOutOfOrder with a default to false, and override in BS to > true. > # Modify BooleanWeight to extend QueryWeight and implement the new scorer > method to return BS2 or BS based on the number of required scorers and > setAllowOutOfOrder. > # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns > true/false. > #* Use it in IndexSearcher.search methods, that accept a Collector, in order > to create the appropriate Scorer, using the new QueryWeight. > #* Provide a static create method to TFC and TSDC which accept this as an > argument and creates the proper instance. > #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order > Scorer and check on the resulting Scorer isOutOfOrder(), so that we can > create the optimized Collector instance. > # Modify IndexSearcher to use all of the above logic. > The only class I'm worried about, and would like to verify with you, is > Searchable. If we want to deprecate all the search methods on IndexSearcher, > Searcher and Searchable which accept Weight and add new ones which accept > QueryWeight, we must do the following: > * Deprecate Searchable in favor of Searcher. > * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) > break back-compat and add them as abstract (like we've done with the new > Collector method) or (2) add them with a default impl to call the Weight > versions, documenting these will become abstract in 3.0. > * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend > Searcher. That's the part I'm a little bit worried about - Searchable > implements java.rmi.Remote, which means there could be an implementation out > there which implements Searchable and extends something different than > UnicastRemoteObject, like Activeable. I think there is very small chance this > has actually happened, but would like to confirm with you guys first. > * Add a deprecated, package-private, SearchableWrapper which extends Searcher > and delegates all calls to the Searchable member. > * Deprecate all uses of Searchable and add Searcher instead, defaulting the > old ones to use SearchableWrapper. > * Make all the necessary changes to IndexSearcher, MultiSearcher etc. > regarding overriding these new methods. > One other optimization that was discussed in LUCENE-1593 is to expose a > topScorer() API (on Weight) which returns a Scorer that its score(Collector
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720538#action_12720538 ] Michael Busch commented on LUCENE-1693: --- OK, what about this sentence in Token.java: {code:java} When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again. {code} This double-cloning is exactly what CachingTokenFilter and Tee/Sink do, so they preserve the actual Token class type. You can easily construct an example similar to the tool I attached that uses these streams. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@luce
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720534#action_12720534 ] Uwe Schindler commented on LUCENE-1693: --- Hi Michael, in principle your test is invalid. It has other tokenfilter over which the user has no control in it. With the two mentioned filters it may work, because they do not change the reuseableToken. But the API clearly states, that the reuseableToken must not be used and another one returned. So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain. In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1. The main advantage is, that you can mix old and new filter instances and you loose nothing... > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeS
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720534#action_12720534 ] Uwe Schindler edited comment on LUCENE-1693 at 6/17/09 12:39 AM: - Hi Michael, in principle your test is invalid. It has other tokenfilters in the chain, which the user has no control on. With the two mentioned filters it may work, because they do not change the reuseableToken instance. But the API clearly states, that the reuseableToken must not be used and another one can be returned. So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain. In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1. The main advantage is, that you can mix old and new filter instances and you loose nothing... was (Author: thetaphi): Hi Michael, in principle your test is invalid. It has other tokenfilter over which the user has no control in it. With the two mentioned filters it may work, because they do not change the reuseableToken. But the API clearly states, that the reuseableToken must not be used and another one returned. So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain. In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1. The main advantage is, that you can mix old and new filter instances and you loose nothing... > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces
Re: New Token API was Re: Payloads and TrieRangeQuery
On 6/15/09 10:10 AM, Grant Ingersoll wrote: But, as Michael M reminded me, it is complex, so please accept my apologies. No worries, Grant! I was not really offended, but rather confused... Thanks for clarifying. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720530#action_12720530 ] Michael Busch commented on LUCENE-1693: --- But I'll definitely buy Uwe a beer if he comes up with solution that is more elegant and doesn't have the mentioned disadvantages! :) > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720529#action_12720529 ] Michael Busch commented on LUCENE-1693: --- I don't think we mention subclassing of Token really in the documentation. We also certainly don't prevent it. The tool I wrote works fine with 2.4, if you add other filters to the chain it might not work anymore. But since we don't promise that subclassing of Token works everywhere, that's probably fine. We're deprecating the old API anyway, so we shouldn't have to introduce new stuff to fully support subclassing Token. My point here is just that this is a very complex API (even though it looks pretty simple). When I wrote the new TokenStream API patch end of last year I thought about all these possibilities of making backwards compatibility more elegant. But I wanted to be certain to not break any runtime behavior or affect performance negatively. Therefore I decided to not mess with the old API, but rather put the burden of implementing both APIs on the committers during the transition phase. I know this is somewhat annoying, on the other hand, how often do we really add new TokenFilters to the core? Often implementing incrementToken() takes 10 minutes if you already have next() implemented. Just copy&paste and change a few things. > AttributeSource/TokenStream API improvements > > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, > TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized