[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844533#action_12844533
 ] 

Robert Muir commented on LUCENE-2309:
-

{quote}
So with the current APIs we cannot get around the requirement to reuse the same 
Attribute instances during the whole indexing without a major speed impact.
{quote}

I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily 
need something that looks like the old TokenStream API, but I feel it would
be worth our time to think about supporting 'some alternative API' that makes
it easier to work with lots of context across different Tokens.

I personally do not mind how this is done with the capture/restore state API,
but I feel that its pretty unnatural for many developers, and in the future 
folks
might want to do more complex analysis (maybe even light pos-tagging, etc)
that requires said context, and we should plan for this.

I feel this wasn't such an issue with the old TokenStream API, but maybe there
is another way to address this potential problem.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844528#action_12844528
 ] 

Uwe Schindler commented on LUCENE-2309:
---

There is one problem that cannot be easy solved (for all proposals here), if we 
want to provide an old-style API that does not require reuse of tokens:
The problem with AttributeProvider is that if we want to support something 
(like rmuir proposed before) that looks like the old "Token next()", we need an 
AttributeProvider that passes the AttributeSource to the indexer on each Token! 
And that would lead to lots of getAttribute() calls, that would slowdown 
indexing! So with the current APIs we cannot get around the requirement to 
reuse the same Attribute instances during the whole indexing without a major 
speed impact. This can only be solved with my nice BCEL proxy Attributes, so 
you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 
(yes, we can reactivate that API somehow as an easy use-addendum).

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844523#action_12844523
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. Then people could freely use Lucene to index off a foreign analysis chain...
That is what I was talking about!

{quote}
I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.
{quote}

We can surely hide this implementation completely from field. I consider this 
being similar to Collector where you pass it explicitly to the search method if 
you want to have a different behavior. Maybe something like a 
AttributeProducer. I don't think adding this to field makes a lot of sense at 
all and it is not worth the complexity.

bq. Will the Field also control how stored fields are added? Or only 
AttributeSourced ones?
IMO this is only about inverted fields.

bq. We (IW) control the indexing flow, and not the user.
The user only gets the possibility to exchange the analysis chain but not the 
control flow. The user already can mess around with stuff in incrementToken(), 
the only thing we change / invert is that the indexer does not know about 
TokenStreams anymore. it does not change the controlflow though.



> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844516#action_12844516
 ] 

Mark Miller commented on LUCENE-2309:
-

bq.  Also IRC is not logged/archived and searchable (I think?) which makes it 
impossible to trace back a discussion, and/or randomly stumble upon it in 
Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a 
great way for people to communicate and hash stuff out, but its not necessary 
you follow it. If you have questions or want further elaboration, just ask. No 
one can expect you to follow IRC, nor is it a valid reference for where 
something was decided. IRC is great - I think its really benefited having devs 
discuss there - but the official position is, if it didn't happen on the list, 
it didnt actually happen.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844515#action_12844515
 ] 

Uwe Schindler commented on LUCENE-2309:
---

bq. I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.

The idea was not to change this behaviour, but also give the user the 
posibility to reverse that. For some tokenstreams it would simplify things 
much. The current IndexWriter code works exactly like that:
# DocInverter gets TokenStream
# DocInverter calls reset() -- to be left out and moved to field/analyzer
# DocInverter does while-loop on incrementToken - it iterates. On each Token it 
calls add() on the field consumer
# DocInverter calls end() and updates end offset
# DocInverter calls close() -- to be left out and moved to field/analyzer

The change is simply that step (3) is removed from DocInverter which only 
provides the add() method for accepting Tokens. The current while loop simply 
is done in the current TokenStream/Field code, so nobody needs to change his 
code. But somebody that actively wants to push tokens can now do this. If he 
wants to do this currently he has no chance without heavy buffering.

So the push API will be very expert and the current TokenStreams is just a user 
of this API.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844509#action_12844509
 ] 

Shai Erera commented on LUCENE-2309:


bq. We should really move back to JIRA / devlist for such discussions

+1 !! I also find it very hard to track so many sources of discussions (JIRA, 
java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and 
searchable (I think?) which makes it impossible to trace back a discussion, 
and/or randomly stumble upon it in Google.

I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.

And having the Field control the flow of indexing seems also dangerous ... 
might expose Lucene to lots of bugs by users. Today when IW controls it, it's 
one place to look for, but tomorrow when Field will control it, where do we 
look? In the app's custom Field code? In IW's atts consuming methods?

Will the Field also control how stored fields are added? Or only 
AttributeSourced ones?

Maybe I need to get used to this change, but currently it looks wrong to 
reverse the control flow. Maybe in principle the DocInverter now accepts tokens 
from IW, and so it looks as if we can pass it to the Field (as IW's 
AttAcceptor), but still the concept is different. We (IW) control the indexing 
flow, and not the user.

I also may not understand what will that give to users. Shouldn't users get 
enough flexibility w/ the current API and the Flex (once out) stuff? Do they 
really need to be bothered w/ pushing tokens to IW?

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844500#action_12844500
 ] 

Michael McCandless commented on LUCENE-2309:


{quote}
bq. Actually, TokenStream is already AttrSource + incrementing, so it seems 
like the right start...

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I 
want to implement something aside the TokenStream delegator API?
{quote}

True, but we need at least some way to increment?  AttrSource doesn't have that.

But I don't think we need reset nor close from TokenStream.

Maybe we could factor out an abstract class / interface that TokenStream impls, 
minus the reset & close methods?

Then people could freely use Lucene to index off a foreign analysis chain...

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844498#action_12844498
 ] 

Michael McCandless commented on LUCENE-2309:


bq. The idea is to, as Simon proposed, let the docinverter implement something 
like AttributeAcceptor.

This is interesting!  It inverts the stack/control flow, but, would continue to 
use shared attrs.

So then somehow the indexer would pass its AttrAcceptor to the field?  And the 
field would have whatever control logic it wants to feed the tokens...

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844489#action_12844489
 ] 

Uwe Schindler commented on LUCENE-2309:
---

bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. 
The code just has an add/accept-method that accepts tokens. The idea is to, as 
Simon proposed, let the docinverter implement something like AttributeAcceptor. 
But still we must have the attribute api and the acceptor (DocInverter) must 
always see the same attribute instances (else much time would be spent to each 
time call getAttribute(...) for each token, if the accept method would take an 
AttributeSource.

The current TokenStream api could get a method taking AttributeAcceptor and 
simply do a while incrementToken() loop, calling accept() on DocInverter (the 
AttributeAcceptor). Another approach for users would be to not use the 
TokenStream API at all and simply call the accept() method for each token.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844467#action_12844467
 ] 

Robert Muir commented on LUCENE-2309:
-

Hello, i commented yesterday but did not receive much feedback, so
I want to elaborate some more:

I suppose what I was trying to mention in my earlier comment here:
https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189

is that while I really like the new TokenStream API, i would prefer
it if we thought about making this flexible enough to support
"different paradigms", including perhaps something that looks a lot
like the old TokenStream API. 

The reason is, I notice a lot of existing code still under this old API,
and I think that in some cases, perhaps its easier to work with, even
if you aren't a new user. I definitely think for newer users the old API
might have some advantages.

I think its useful to consider supporting such an API, perhaps as an extension
in contrib/analyzers, even if its not as fast or flexible as the new API,
perhaps the tradeoff of speed and flexibility would be worth the ease
for newer users.


> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844464#action_12844464
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. [Carrying over discussions on IRC with Chris Male & Uwe...]

That make it very hard to participate. I can not afford to read through all IRC 
stuff and I don't get the chance to participate directly unless I watch IRC 
constantly. We should really move back to JIRA / devlist for such discussions. 
There is too much going on in IRC.

{quote}
Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...
{quote}

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I 
want to implement something aside the TokenStream delegator API?



> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844450#action_12844450
 ] 

Michael McCandless commented on LUCENE-2309:


bq. The IndexWriter or rather DocInverterPerField are simply an attribute 
consumer. None of them needs to know about Analyzer or TokenStream at all. 
Neither needs the analyzer to iterate over tokens.

[Carrying over discussions on IRC with Chris Male & Uwe...]

Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...

However, the .reset() method is redundant from indexer's standpoint --
ie when indexer calls Field.getTokenStream (say) whatever init'ing /
reset'ing should already have be done by that method by the time it
returns the TokenStream.

Also, .close and .end are redundant -- seems like we should only have
.end (few token streams do anything in .close...).  But coalescing
those two would be a good chunk of work at this point :) Or maybe we
make a .finish that simply both by default ;)

Finally, indexer doesn't really need a Document; it just needs
something abstract that's provides an iterator over all fields that
need indexing (and separately, storing).


> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844420#action_12844420
 ] 

Simon Willnauer commented on LUCENE-2309:
-

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. 
None of them needs to know about Analyzer or TokenStream at all. Neither needs 
the analyzer to iterate over tokens. The IndexWriter should instead implement 
an interface or use a class that is called for each successful 
"incrementToken()" no matter how this increment is implemented.

I could imagine a really simple interface like
{code}

interface AttributeConsumer {
  
  void setAttributeSource(AttributeSource src);

  void next();

  void end();

}
{code}

IW would then pass itself or an istance it uses (DocInverterPerField) to an API 
expecting such a consumer like:

{code}
field.consume(this);
{code}

or something similar. That way we have not dependency on whatever Attribute 
producer is used. The default implementation is for sure based on an analyzer / 
tokenstream and alternatives can be exposed via expert API. Even Backwards 
compatibility could be solved that way easily.

bq. Only tests would rely on the analyzers module. I think that's OK? core 
itself would have no dependence.
+1 test dependencies should not block modularization, its just about 
configuring the classpath though!



> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844214#action_12844214
 ] 

Shai Erera commented on LUCENE-2309:


Today when I "ant test-core" contrib is not built, and I like it. Also "ant 
test-backwards" will be affected I think ... I think if core does not depend on 
contrib, its tests shouldn't also. It's weird if it will.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844207#action_12844207
 ] 

Michael McCandless commented on LUCENE-2309:


{quote}

bq. Or remove them entirely (but, then, core tests will need to use contrib 
analyzers for their testing)...

For that I proposed to have a default TestAttributeSourceImpl, which does 
whitespace tokenization or something. If other 'core' tests need something 
else, we can write specific AttributeSources for them. I hope we can avoid 
introducing any dependency of core on contrib.
{quote}

Only tests would rely on the analyzers module.  I think that's OK?  core itself 
would have no dependence.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844189#action_12844189
 ] 

Robert Muir commented on LUCENE-2309:
-

bq. For that I proposed to have a default TestAttributeSourceImpl

We need a bit more than AttributeSource, at least if the text has 
more than one token, it must at least support incrementToken()

We could try factoring out incrementToken() and end() from
TokenStream to create a "more-generic" interface, but really,
there isn't much more to Tokenstream (except close and reset)

At the same time, while I really like the decorator API of 
TokenStream, it should be easier for someone to use a completely
different API, perhaps one that feels less like you are writing
a finite-state machine by hand (capture/restoreState, etc)


> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844182#action_12844182
 ] 

Shai Erera commented on LUCENE-2309:


bq. Or remove them entirely (but, then, core tests will need to use contrib 
analyzers for their testing)...

For that I proposed to have a default TestAttributeSourceImpl, which does 
whitespace tokenization or something. If other 'core' tests need something 
else, we can write specific AttributeSources for them. I hope we can avoid 
introducing any dependency of core on contrib.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844180#action_12844180
 ] 

Robert Muir commented on LUCENE-2309:
-

{quote}
Or remove them entirely (but, then, core tests will need to use
contrib analyzers for their testing)...
{quote}

I agree, lets not get caught up on how our tests run from build.xml!
We should decouple analysis from IW as much as possible, at least to support 
more flexible analysis: e.g. someone doesnt want to use the TokenStream 
concept at all, for example.

I don't really have any opinion practically where all the analyzers go, but I 
do agree
it would be nice if they were in one place. For example, in contrib/analyzers 
now
we have analyzers by language, and in most cases, users should really be looking
at EnglishAnalyzer as their "default" instead of StandardAnalyzer for English 
language,
as it does Porter stemming, too.


> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844177#action_12844177
 ] 

Michael McCandless commented on LUCENE-2309:


bq. Would this mean that after that we can move all of core Analyzers to 
contrib/analyzers

Yes, though, I think that's orthogonal (can and should be separately
done, anyway).

bq. making one step towards getting them completely out of Lucene and into 
their own Apache project?

We may simply "standardize" on contrib/analyzers as the one place,
instead of a new [sub-]project.  To be discussed... but we really do
need one place.

bq. That way, we can keep in core only the AttributeSource and accompanying 
classes, and really allow people to pass AttributeSource which is not even an 
Analyzer (like you said).  We can move the specific Analyzer tests to 
contrib/analyzers as well. The other tests in core, who don't care about 
analysis, can use a src/test specific AttributeSource, like 
TestAttributeSourceImpl ...

Right.

bq. I'm thinking - it's ok for contrib to depend on core but not the other way 
around.

I agree.

bq. It will however take out of core a useful feature for new users which 
allows fast bootstrap.

Well.. I suspect with this change users would not typically use
lucene-core alone.  Ie, they'd get analyzers and queryparser (if we
also move it out as its own module).

bq. That won't be the case when analyzers move out of Lucene entirely, but 
while they are in Lucene, we'll force everyone to download contrib/analyzers as 
well.

I think a single source for all analyzers will be a great step
forwards for users.

bq. So maybe we keep in core only Standard, or maybe even something simpler, 
again, for easy bootstrapping (like Whitespace + lowercase).

Or remove them entirely (but, then, core tests will need to use
contrib analyzers for their testing)...


> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844155#action_12844155
 ] 

Shai Erera commented on LUCENE-2309:


Would this mean that after that we can move all of core Analyzers to 
contrib/analyzers, making one step towards getting them completely out of 
Lucene and into their own Apache project?

That way, we can keep in core only the AttributeSource and accompanying 
classes, and really allow people to pass AttributeSource which is not even an 
Analyzer (like you said). We can move the specific Analyzer tests to 
contrib/analyzers as well. The other tests in core, who don't care about 
analysis, can use a src/test specific AttributeSource, like 
TestAttributeSourceImpl ...

I'm thinking - it's ok for contrib to depend on core but not the other way 
around. It will however take out of core a useful feature for new users which 
allows fast bootstrap. That won't be the case when analyzers move out of Lucene 
entirely, but while they are in Lucene, we'll force everyone to download 
contrib/analyzers as well. So maybe we keep in core only Standard, or maybe 
even something simpler, again, for easy bootstrapping (like Whitespace + 
lowercase).

This is just a thought.

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843772#action_12843772
 ] 

Michael McCandless commented on LUCENE-2309:


We can't use attr source directly -- we'd need to factor out
the minimal API from TokenStream (.incrToken & .end?) and
use that (thanks Robert!).

> Fully decouple IndexWriter from analyzers
> -
>
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org