[ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842753#action_12842753
 ] 

Michael McCandless commented on LUCENE-2302:
--------------------------------------------

Patch looks great Uwe!  Great simplification that indexer only deals
in byte[] now for a term.  It's agnostic as to whether those bytes are
utf8 or something else.  And it means analyzer chains can now do
direct binary terms (eg NumericTokenStream).

Some day... we should also try to have indexer not be responsible for
creating the TokenStream.  Ie it should simply receive, always, an
AttrSource for a field that needs to be indexed.  This puts nice
distance b/w indexer core and analysis -- indexer is then fully
agnostic to how that AttrSource came to be.

I see "noncommit" -- can you rename to "nocommit" -- let's try to be
consistent ;)

Maybe rewword "The given AttributeSource has no term attribute" -->
"Could not find a term attribute (that implements
TermToBytesAttribute) in the AttributeSource"?

I think we should rename TermsHashPerField's utf8 var (and in the per
thread) -- it's now just bytes, not necessarily utf8.  Maybe termBytes?

When you temporarily override the length of a too-long term, maybe
restore it in a try/finally?


> Replacement for TermAttribute+Impl with extended capabilities (byte[] 
> support, CharSequence, Appendable)
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2302
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2302
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: Flex Branch
>            Reporter: Uwe Schindler
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
> LUCENE-2302.patch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current 
> TermAttribute only supports char[]. This is fine for plain text, but e.g 
> NumericTokenStream should directly work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for 
> users to work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API 
> that concentrates on CharSequence and Appendable.
> The implementation class will simply support the old and new interface 
> working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
> this. So if somebody adds a TermAttribute, he will get an implementation 
> class that can be also used as CharTermAttribute. As both attributes create 
> the same impl instance both calls to addAttribute are equal. So a TokenFilter 
> that adds CharTermAttribute to the source will work with the same instance as 
> the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a 
> separate getter-only interface will be added, that returns a reusable 
> BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
> also support this interface. For backwards compatibility with old 
> self-made-TermAttribute implementations, the indexer will check with 
> hasAttribute(), if the BytesRef getter interface is there and if not will 
> wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
> new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
> indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to