[
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler updated LUCENE-2302:
----------------------------------
Description:
For flexible indexing terms can be simple byte[] arrays, while the current
TermAttribute only supports char[]. This is fine for plain text, but e.g
NumericTokenStream should directly work on the byte[] array.
Also TermAttribute lacks of some interfaces that would make it simplier for
users to work with them: Appendable and CharSequence
I propose to create a new interface "CharTermAttribute" with a clean new API
that concentrates on CharSequence and Appendable.
The implementation class will simply support the old and new interface working
on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So
if somebody adds a TermAttribute, he will get an implementation class that can
be also used as CharTermAttribute. As both attributes create the same impl
instance both calls to addAttribute are equal. So a TokenFilter that adds
CharTermAttribute to the source will work with the same instance as the
Tokenizer that requested the (deprecated) TermAttribute.
To also support byte[] only terms like Collation or NumericField needs, a
separate getter-only interface will be added, that returns a reusable BytesRef,
e.g. BytesRefGetterAttribute. The default implementation class will also
support this interface. For backwards compatibility with old
self-made-TermAttribute implementations, the indexer will check with
hasAttribute(), if the BytesRef getter interface is there and if not will wrap
a old-style TermAttribute (a deprecated wrapper class will be provided): new
BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then.
was:
For flexible indexing terms can be simple byte[] arrays, while the current
TermAttribute only supports char[]. This is fine for plain text, but e.g
NumericTokenStream should directly work on the byte[] array.
Also TermAttribute lacks of some interfaces that would make it simplier for
users to work with them: Appendable and CharSequence
I propose to create a new interface "ExtendedTermAttribute extends
TermAttribute". The corresponding -Impl class is always an implementation that
extends ExtendedTermAttribute . So if somebody adds a TermAttribute an
AttributeSource he will get an implementation class that can be also used as
TermAttribute2. As both attributes create the same impl instance both calls to
addAttribute are equal. So a TokenFilter that adds ExtendedTermAttribute to the
source will work with the same instance as the Tokenizer that requested the
(deprecated) TermAttribute.
To support both byte[] and char[] the internals will be implemented like Token
in 2.9: Support for String and char[]. So the buffers are both available, but
you can only use one of them. as soon as you call getByteBuffer(), and the
char[] buffer is used, it will be transformed. So the inder will always call
getBytes() and get the UTF-8 bytes. NumericTokenStream will modify the byte[]
directly and if no filter that uses char[] is plugged on top, the buffer is
never transformed.
This issue will also convert the rest of NRQ to byte[] and deprecate all old
methods in NumericUtils. NRQ will directly request ByteRef from splitRange and
so on.
> Replacement for TermAttribute+Impl with extended capabilities (byte[]
> support, CharSequence, Appendable)
> --------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2302
> URL: https://issues.apache.org/jira/browse/LUCENE-2302
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: Flex Branch
> Reporter: Uwe Schindler
> Fix For: Flex Branch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current
> TermAttribute only supports char[]. This is fine for plain text, but e.g
> NumericTokenStream should directly work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for
> users to work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API
> that concentrates on CharSequence and Appendable.
> The implementation class will simply support the old and new interface
> working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of
> this. So if somebody adds a TermAttribute, he will get an implementation
> class that can be also used as CharTermAttribute. As both attributes create
> the same impl instance both calls to addAttribute are equal. So a TokenFilter
> that adds CharTermAttribute to the source will work with the same instance as
> the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a
> separate getter-only interface will be added, that returns a reusable
> BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will
> also support this interface. For backwards compatibility with old
> self-made-TermAttribute implementations, the indexer will check with
> hasAttribute(), if the BytesRef getter interface is there and if not will
> wrap a old-style TermAttribute (a deprecated wrapper class will be provided):
> new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the
> indexer then.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]