[jira] [Commented] (LUCENE-3666) Update org.apache.lucene.analysis package summary

Uwe Schindler (Commented) (JIRA) Tue, 17 Jan 2012 14:58:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188103#comment-13188103
 ]


Uwe Schindler commented on LUCENE-3666:
---------------------------------------

Here my commets as posted on IRC:

22:38   ThetaPh1        + A CharStream adds character offset correction 
functionality over
22:38   ThetaPh1        + {@link java.io.Reader}. All Tokenizers accept a 
CharStream instead of
22:38   ThetaPh1        + Reader as input, which enables arbitrary character 
based filtering
22:38   ThetaPh1        + before tokenization.
22:39   ThetaPh1        ah charfilters are also there
22:39   ThetaPh1        because that description is a little bit limited, 
charstreams on itsself are never used
22:40   sarowe  right
22:40   ThetaPh1        but there is missing some general information what 
CharFilters do, at least I dont see it in the patch
22:40   ThetaPh1        the reader simply say: wtf is this charstream good for?
22:40   sarowe  good point
22:40   sarowe  I'll revisit
22:41   ThetaPh1        in the following para i would replace CharStream by 
CharFilter
22:41   sarowe  (I know more about CharFilter guts after working on 
HTMLStripCharFilter replacement)
22:41   ThetaPh1        the input is in all cases a Reader
22:41   ThetaPh1        hehe yes
22:41   ThetaPh1        in my opinion the charfilters are horrible by the design
22:41   ThetaPh1        we changed it shortly before 2.9 to fix some very bad 
behaviour
22:41   sarowe  right, I recall that - performance fixes
22:41   ThetaPh1        but its still hard to understand whats going on
22:42   sarowe  yes, and no docs
22:42   ThetaPh1        the problem is that they wrap Readers
22:42   ThetaPh1        and instanceof checks in Tokenizer and so on
22:42   sarowe  I've added a little more docs in the JFlexHTMLStripCharFilter 
issue
22:42   ThetaPh1        to prevent those instanceof checks everywhere in code, 
Tokenizer has a correctOffset method, right?
22:43   sarowe  ok, I know about the method, didn't know that was why it was 
there
22:43   ThetaPh1        + <b>Lucene 2.9 introduced a new TokenStream API. 
Please see the section "New TokenStream API" below for more details.</b>
22:43   ThetaPh1        we should chnage the second sentence, there is no old 
api anymore
22:43   sarowe  right
22:44   sarowe  in trunk, anyway
22:45   ThetaPh1        in 3.x, the same
22:45   ThetaPh1        and remove "new"
22:45   ThetaPh1        the example with LengthFilter is good
22:45   sarowe  cool
22:45   ThetaPh1        as it shows as example how its implemented (for 
filtering tokens based on accept())
22:46   ThetaPh1        but also how a conventional filter would look like
22:46   sarowe  right
22:47   ThetaPh1        equals and hascode no longer need to be implemented
22:47   ThetaPh1        its no longer required
22:47   sarowe  ok
22:48   ThetaPh1        + {@literal @Override}
22:48   ThetaPh1        public void copyTo(AttributeImpl target) {
22:48   ThetaPh1        ((PartOfSpeechAttributeImpl) target).pos = pos;
22:48   ThetaPh1        }
22:48   ThetaPh1        this one shpoudl not cast to *Impl
22:48   ThetaPh1        it should simply cast to the interface
22:48   sarowe  ok
22:48   ThetaPh1        its done like this in all attributes in lucene, maybe 
we missed that one in docs
22:49   sarowe  I'll check
22:49   ThetaPh1        the idea is that e.g. a CharTermAttribute can be copied 
to a good old Token (die,die,die)
22:49   ThetaPh1        so the copy operation should not rely on the type
22:49   ThetaPh1        i mean impl
22:49   sarowe  right, the interface instead
22:50   ThetaPh1        ((PartOfSpeechAttributeImpl) target).setPos(pos);
22:50   ThetaPh1        something like that
22:50   ThetaPh1        a without impl
22:50   sarowe  :) right
22:50   ThetaPh1        ((PartOfSpeechAttribute) target).setPos(pos);
22:50   sarowe  ok
22:50   ThetaPh1        attributes also no longer need to impl toString(), but 
thats not in the example
22:51   ThetaPh1        they can implement reflectWith for nice debugging 
output in solr
22:51   ThetaPh1        but thats too much information
22:51   sarowe  :)
22:51   ThetaPh1        just remove the hashcode/equals and toString if they 
are in exaple
22:51   ThetaPh1        a minimum example would be ideal
22:51   sarowe  ok
22:52   ThetaPh1        +<code>AttributeImpl</code> class and therefore 
implements its abstract methods <code>clear(), copyTo(), equals(), 
hashCode()</code>.
22:52   ThetaPh1        not sure how this is solved in 3.x
22:52   ThetaPh1        in trunk they are gone
22:52   ThetaPh1        (have to look up)
22:52   sarowe  ok
22:52   ThetaPh1        i only know that in 3.x most attributes that existed 
before simply implement equals/hashcode
22:52   ThetaPh1        but just for backwards reasons
22:53   sarowe  ok
22:53   ThetaPh1        one thing
22:54   ThetaPh1        you should note for CharTermAttribute that it implemens 
CharSequence and Appendable
22:54   ThetaPh1        i had a code review before
22:54   ThetaPh1        and have seen stupidness like calling toString() useless
22:54   sarowe  right
22:54   ThetaPh1        i have seen people doing termAtt.toString().length() < 
10 in a lengthfilter-like fileter
22:54   sarowe  that's the main reason for CharTermAttr to replace TermAttr, I 
believe
22:55   ThetaPh1        yes
22:55   ThetaPh1        otherwise I see nothing wrong
22:55   sarowe  cool, thanks for the review

                
> Update org.apache.lucene.analysis package summary
> -------------------------------------------------
>
>                 Key: LUCENE-3666
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3666
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: general/javadocs
>    Affects Versions: 3.5
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-3666-branch_3x.patch, 
> LUCENE-3666-branch_3x.patch, LUCENE-3666-branch_3x.patch, 
> LUCENE-3666-trunk.patch
>
>
> {{package.html}} in {{lucene/src/java/org/apache/lucene/analysis/}} is out of 
> date.
> It looks like the contents of the branch_3x version haven't changed 
> substantially since the Lucene 2.9 release, e.g. it refers to 
> {{TermAttribute}} instead of {{CharTermAttribute}}.
> The trunk version is more modern - it refers to {{CharTermAttribute}} - but 
> it also has some issues.  E.g., I can see that the {{LengthFilter}} 
> discussion doesn't refer to {{FilteringTokenFilter}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3666) Update org.apache.lucene.analysis package summary

Reply via email to