[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844533#action_12844533 ] Robert Muir commented on LUCENE-2309: - {quote} So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. {quote} I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily need something that looks like the old TokenStream API, but I feel it would be worth our time to think about supporting 'some alternative API' that makes it easier to work with lots of context across different Tokens. I personally do not mind how this is done with the capture/restore state API, but I feel that its pretty unnatural for many developers, and in the future folks might want to do more complex analysis (maybe even light pos-tagging, etc) that requires said context, and we should plan for this. I feel this wasn't such an issue with the old TokenStream API, but maybe there is another way to address this potential problem. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844528#action_12844528 ] Uwe Schindler commented on LUCENE-2309: --- There is one problem that cannot be easy solved (for all proposals here), if we want to provide an old-style API that does not require reuse of tokens: The problem with AttributeProvider is that if we want to support something (like rmuir proposed before) that looks like the old "Token next()", we need an AttributeProvider that passes the AttributeSource to the indexer on each Token! And that would lead to lots of getAttribute() calls, that would slowdown indexing! So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. This can only be solved with my nice BCEL proxy Attributes, so you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 (yes, we can reactivate that API somehow as an easy use-addendum). > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844523#action_12844523 ] Simon Willnauer commented on LUCENE-2309: - bq. Then people could freely use Lucene to index off a foreign analysis chain... That is what I was talking about! {quote} I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. {quote} We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity. bq. Will the Field also control how stored fields are added? Or only AttributeSourced ones? IMO this is only about inverted fields. bq. We (IW) control the indexing flow, and not the user. The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844516#action_12844516 ] Mark Miller commented on LUCENE-2309: - bq. Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844515#action_12844515 ] Uwe Schindler commented on LUCENE-2309: --- bq. I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. The idea was not to change this behaviour, but also give the user the posibility to reverse that. For some tokenstreams it would simplify things much. The current IndexWriter code works exactly like that: # DocInverter gets TokenStream # DocInverter calls reset() -- to be left out and moved to field/analyzer # DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add() on the field consumer # DocInverter calls end() and updates end offset # DocInverter calls close() -- to be left out and moved to field/analyzer The change is simply that step (3) is removed from DocInverter which only provides the add() method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field code, so nobody needs to change his code. But somebody that actively wants to push tokens can now do this. If he wants to do this currently he has no chance without heavy buffering. So the push API will be very expert and the current TokenStreams is just a user of this API. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844509#action_12844509 ] Shai Erera commented on LUCENE-2309: bq. We should really move back to JIRA / devlist for such discussions +1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. And having the Field control the flow of indexing seems also dangerous ... might expose Lucene to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow when Field will control it, where do we look? In the app's custom Field code? In IW's atts consuming methods? Will the Field also control how stored fields are added? Or only AttributeSourced ones? Maybe I need to get used to this change, but currently it looks wrong to reverse the control flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We (IW) control the indexing flow, and not the user. I also may not understand what will that give to users. Shouldn't users get enough flexibility w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing tokens to IW? > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844500#action_12844500 ] Michael McCandless commented on LUCENE-2309: {quote} bq. Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? {quote} True, but we need at least some way to increment? AttrSource doesn't have that. But I don't think we need reset nor close from TokenStream. Maybe we could factor out an abstract class / interface that TokenStream impls, minus the reset & close methods? Then people could freely use Lucene to index off a foreign analysis chain... > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844498#action_12844498 ] Michael McCandless commented on LUCENE-2309: bq. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. This is interesting! It inverts the stack/control flow, but, would continue to use shared attrs. So then somehow the indexer would pass its AttrAcceptor to the field? And the field would have whatever control logic it wants to feed the tokens... > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844489#action_12844489 ] Uwe Schindler commented on LUCENE-2309: --- bq. I could imagine a really simple interface like During lunch an idea evolved: If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always see the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource. The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844467#action_12844467 ] Robert Muir commented on LUCENE-2309: - Hello, i commented yesterday but did not receive much feedback, so I want to elaborate some more: I suppose what I was trying to mention in my earlier comment here: https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189 is that while I really like the new TokenStream API, i would prefer it if we thought about making this flexible enough to support "different paradigms", including perhaps something that looks a lot like the old TokenStream API. The reason is, I notice a lot of existing code still under this old API, and I think that in some cases, perhaps its easier to work with, even if you aren't a new user. I definitely think for newer users the old API might have some advantages. I think its useful to consider supporting such an API, perhaps as an extension in contrib/analyzers, even if its not as fast or flexible as the new API, perhaps the tradeoff of speed and flexibility would be worth the ease for newer users. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844464#action_12844464 ] Simon Willnauer commented on LUCENE-2309: - bq. [Carrying over discussions on IRC with Chris Male & Uwe...] That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC. {quote} Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... {quote} But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844450#action_12844450 ] Michael McCandless commented on LUCENE-2309: bq. The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. [Carrying over discussions on IRC with Chris Male & Uwe...] Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... However, the .reset() method is redundant from indexer's standpoint -- ie when indexer calls Field.getTokenStream (say) whatever init'ing / reset'ing should already have be done by that method by the time it returns the TokenStream. Also, .close and .end are redundant -- seems like we should only have .end (few token streams do anything in .close...). But coalescing those two would be a good chunk of work at this point :) Or maybe we make a .finish that simply both by default ;) Finally, indexer doesn't really need a Document; it just needs something abstract that's provides an iterator over all fields that need indexing (and separately, storing). > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844420#action_12844420 ] Simon Willnauer commented on LUCENE-2309: - The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful "incrementToken()" no matter how this increment is implemented. I could imagine a really simple interface like {code} interface AttributeConsumer { void setAttributeSource(AttributeSource src); void next(); void end(); } {code} IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like: {code} field.consume(this); {code} or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily. bq. Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence. +1 test dependencies should not block modularization, its just about configuring the classpath though! > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844214#action_12844214 ] Shai Erera commented on LUCENE-2309: Today when I "ant test-core" contrib is not built, and I like it. Also "ant test-backwards" will be affected I think ... I think if core does not depend on contrib, its tests shouldn't also. It's weird if it will. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844207#action_12844207 ] Michael McCandless commented on LUCENE-2309: {quote} bq. Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib. {quote} Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844189#action_12844189 ] Robert Muir commented on LUCENE-2309: - bq. For that I proposed to have a default TestAttributeSourceImpl We need a bit more than AttributeSource, at least if the text has more than one token, it must at least support incrementToken() We could try factoring out incrementToken() and end() from TokenStream to create a "more-generic" interface, but really, there isn't much more to Tokenstream (except close and reset) At the same time, while I really like the decorator API of TokenStream, it should be easier for someone to use a completely different API, perhaps one that feels less like you are writing a finite-state machine by hand (capture/restoreState, etc) > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844182#action_12844182 ] Shai Erera commented on LUCENE-2309: bq. Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844180#action_12844180 ] Robert Muir commented on LUCENE-2309: - {quote} Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... {quote} I agree, lets not get caught up on how our tests run from build.xml! We should decouple analysis from IW as much as possible, at least to support more flexible analysis: e.g. someone doesnt want to use the TokenStream concept at all, for example. I don't really have any opinion practically where all the analyzers go, but I do agree it would be nice if they were in one place. For example, in contrib/analyzers now we have analyzers by language, and in most cases, users should really be looking at EnglishAnalyzer as their "default" instead of StandardAnalyzer for English language, as it does Porter stemming, too. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844177#action_12844177 ] Michael McCandless commented on LUCENE-2309: bq. Would this mean that after that we can move all of core Analyzers to contrib/analyzers Yes, though, I think that's orthogonal (can and should be separately done, anyway). bq. making one step towards getting them completely out of Lucene and into their own Apache project? We may simply "standardize" on contrib/analyzers as the one place, instead of a new [sub-]project. To be discussed... but we really do need one place. bq. That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ... Right. bq. I'm thinking - it's ok for contrib to depend on core but not the other way around. I agree. bq. It will however take out of core a useful feature for new users which allows fast bootstrap. Well.. I suspect with this change users would not typically use lucene-core alone. Ie, they'd get analyzers and queryparser (if we also move it out as its own module). bq. That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. I think a single source for all analyzers will be a great step forwards for users. bq. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase). Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)... > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844155#action_12844155 ] Shai Erera commented on LUCENE-2309: Would this mean that after that we can move all of core Analyzers to contrib/analyzers, making one step towards getting them completely out of Lucene and into their own Apache project? That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ... I'm thinking - it's ok for contrib to depend on core but not the other way around. It will however take out of core a useful feature for new users which allows fast bootstrap. That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase). This is just a thought. > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843772#action_12843772 ] Michael McCandless commented on LUCENE-2309: We can't use attr source directly -- we'd need to factor out the minimal API from TokenStream (.incrToken & .end?) and use that (thanks Robert!). > Fully decouple IndexWriter from analyzers > - > > Key: LUCENE-2309 > URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > IndexWriter only needs an AttributeSource to do indexing. > Yet, today, it interacts with Field instances, holds a private > analyzers, invokes analyzer.reusableTokenStream, has to deal with a > wide variety (it's not analyzed; it is analyzed but it's a Reader, > String; it's pre-analyzed). > I'd like to have IW only interact with attr sources that already > arrived with the fields. This would be a powerful decoupling -- it > means others are free to make their own attr sources. > They need not even use any of Lucene's analysis impls; eg they can > integrate to other things like [OpenPipeline|http://www.openpipeline.org]. > Or make something completely custom. > LUCENE-2302 is already a big step towards this: it makes IW agnostic > about which attr is "the term", and only requires that it provide a > BytesRef (for flex). > Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the > FieldType knows the analyzer to use, then we could simply create a > getAttrSource() method (say) on it and move all the logic IW has today > onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org