[
https://issues.apache.org/jira/browse/LUCENE-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188103#comment-13188103
]
Uwe Schindler commented on LUCENE-3666:
---------------------------------------
Here my commets as posted on IRC:
22:38 ThetaPh1 + A CharStream adds character offset correction
functionality over
22:38 ThetaPh1 + {@link java.io.Reader}. All Tokenizers accept a
CharStream instead of
22:38 ThetaPh1 + Reader as input, which enables arbitrary character
based filtering
22:38 ThetaPh1 + before tokenization.
22:39 ThetaPh1 ah charfilters are also there
22:39 ThetaPh1 because that description is a little bit limited,
charstreams on itsself are never used
22:40 sarowe right
22:40 ThetaPh1 but there is missing some general information what
CharFilters do, at least I dont see it in the patch
22:40 ThetaPh1 the reader simply say: wtf is this charstream good for?
22:40 sarowe good point
22:40 sarowe I'll revisit
22:41 ThetaPh1 in the following para i would replace CharStream by
CharFilter
22:41 sarowe (I know more about CharFilter guts after working on
HTMLStripCharFilter replacement)
22:41 ThetaPh1 the input is in all cases a Reader
22:41 ThetaPh1 hehe yes
22:41 ThetaPh1 in my opinion the charfilters are horrible by the design
22:41 ThetaPh1 we changed it shortly before 2.9 to fix some very bad
behaviour
22:41 sarowe right, I recall that - performance fixes
22:41 ThetaPh1 but its still hard to understand whats going on
22:42 sarowe yes, and no docs
22:42 ThetaPh1 the problem is that they wrap Readers
22:42 ThetaPh1 and instanceof checks in Tokenizer and so on
22:42 sarowe I've added a little more docs in the JFlexHTMLStripCharFilter
issue
22:42 ThetaPh1 to prevent those instanceof checks everywhere in code,
Tokenizer has a correctOffset method, right?
22:43 sarowe ok, I know about the method, didn't know that was why it was
there
22:43 ThetaPh1 + <b>Lucene 2.9 introduced a new TokenStream API.
Please see the section "New TokenStream API" below for more details.</b>
22:43 ThetaPh1 we should chnage the second sentence, there is no old
api anymore
22:43 sarowe right
22:44 sarowe in trunk, anyway
22:45 ThetaPh1 in 3.x, the same
22:45 ThetaPh1 and remove "new"
22:45 ThetaPh1 the example with LengthFilter is good
22:45 sarowe cool
22:45 ThetaPh1 as it shows as example how its implemented (for
filtering tokens based on accept())
22:46 ThetaPh1 but also how a conventional filter would look like
22:46 sarowe right
22:47 ThetaPh1 equals and hascode no longer need to be implemented
22:47 ThetaPh1 its no longer required
22:47 sarowe ok
22:48 ThetaPh1 + {@literal @Override}
22:48 ThetaPh1 public void copyTo(AttributeImpl target) {
22:48 ThetaPh1 ((PartOfSpeechAttributeImpl) target).pos = pos;
22:48 ThetaPh1 }
22:48 ThetaPh1 this one shpoudl not cast to *Impl
22:48 ThetaPh1 it should simply cast to the interface
22:48 sarowe ok
22:48 ThetaPh1 its done like this in all attributes in lucene, maybe
we missed that one in docs
22:49 sarowe I'll check
22:49 ThetaPh1 the idea is that e.g. a CharTermAttribute can be copied
to a good old Token (die,die,die)
22:49 ThetaPh1 so the copy operation should not rely on the type
22:49 ThetaPh1 i mean impl
22:49 sarowe right, the interface instead
22:50 ThetaPh1 ((PartOfSpeechAttributeImpl) target).setPos(pos);
22:50 ThetaPh1 something like that
22:50 ThetaPh1 a without impl
22:50 sarowe :) right
22:50 ThetaPh1 ((PartOfSpeechAttribute) target).setPos(pos);
22:50 sarowe ok
22:50 ThetaPh1 attributes also no longer need to impl toString(), but
thats not in the example
22:51 ThetaPh1 they can implement reflectWith for nice debugging
output in solr
22:51 ThetaPh1 but thats too much information
22:51 sarowe :)
22:51 ThetaPh1 just remove the hashcode/equals and toString if they
are in exaple
22:51 ThetaPh1 a minimum example would be ideal
22:51 sarowe ok
22:52 ThetaPh1 +<code>AttributeImpl</code> class and therefore
implements its abstract methods <code>clear(), copyTo(), equals(),
hashCode()</code>.
22:52 ThetaPh1 not sure how this is solved in 3.x
22:52 ThetaPh1 in trunk they are gone
22:52 ThetaPh1 (have to look up)
22:52 sarowe ok
22:52 ThetaPh1 i only know that in 3.x most attributes that existed
before simply implement equals/hashcode
22:52 ThetaPh1 but just for backwards reasons
22:53 sarowe ok
22:53 ThetaPh1 one thing
22:54 ThetaPh1 you should note for CharTermAttribute that it implemens
CharSequence and Appendable
22:54 ThetaPh1 i had a code review before
22:54 ThetaPh1 and have seen stupidness like calling toString() useless
22:54 sarowe right
22:54 ThetaPh1 i have seen people doing termAtt.toString().length() <
10 in a lengthfilter-like fileter
22:54 sarowe that's the main reason for CharTermAttr to replace TermAttr, I
believe
22:55 ThetaPh1 yes
22:55 ThetaPh1 otherwise I see nothing wrong
22:55 sarowe cool, thanks for the review
> Update org.apache.lucene.analysis package summary
> -------------------------------------------------
>
> Key: LUCENE-3666
> URL: https://issues.apache.org/jira/browse/LUCENE-3666
> Project: Lucene - Java
> Issue Type: Improvement
> Components: general/javadocs
> Affects Versions: 3.5
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Priority: Minor
> Attachments: LUCENE-3666-branch_3x.patch,
> LUCENE-3666-branch_3x.patch, LUCENE-3666-branch_3x.patch,
> LUCENE-3666-trunk.patch
>
>
> {{package.html}} in {{lucene/src/java/org/apache/lucene/analysis/}} is out of
> date.
> It looks like the contents of the branch_3x version haven't changed
> substantially since the Lucene 2.9 release, e.g. it refers to
> {{TermAttribute}} instead of {{CharTermAttribute}}.
> The trunk version is more modern - it refers to {{CharTermAttribute}} - but
> it also has some issues. E.g., I can see that the {{LengthFilter}}
> discussion doesn't refer to {{FilteringTokenFilter}}.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]