[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Robert Muir (JIRA) Sun, 17 Jul 2011 07:23:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066650#comment-13066650
 ]


Robert Muir commented on LUCENE-2309:
-------------------------------------

bq. haven't addressed the OffSetGap issue

I actually think these gaps are what we should address first. here's a rough 
idea:
* remove offset/position increment gap from Analyzer.
* instead for multivalued fields, the field handles this internally. so it 
returns a MultiValuedTokenstream? that does the 'concatenation'/offset/position 
increasing between fields itself. IndexWriter just sees one tokenstream for the 
field and doesn't know about this, e.g. it just consumes positions and offsets.

To do this, I think there could be problems if the analyzer does not reuse, as 
it should be one set of attributes to the indexer across the multivalued field.

so first to solve this problem: I think first we should remove 
Analyzer.tokenStream so all analyzers are reusable, and push 
ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement 
anyway to solve that trap.
 

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2309.patch
>
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Reply via email to