[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Steven Rowe (Commented) (JIRA) Sun, 01 Apr 2012 09:58:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243784#comment-13243784
 ]


Steven Rowe commented on LUCENE-3940:
-------------------------------------

bq. I think its well accepted that words carry the information content of a 
doc, punctuation has no information content really here, it doesn't tell me 
what the doc is about, and I don't think this is controversial, I just think 
your view on this is extreme...

I disagree with you, Robert.  (If punctuation has no information content, why 
does it exist?)  IMHO Mike's examples are not at all extreme, e.g. some 
punctuation tokens could be used to trigger position increment gaps.

bq. StandardTokenizer doesnt leave holes when it drops punctuation, I think 
holes should only be real 'words' for the most part

"Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word 
*boundaries*.  Lucene has grafted onto these boundary rules an assumption that 
only alphanumeric "words" should be tokens - this assumption does not exist in 
the standard itself.

My opinion is that we should have both types of things: a tokenizer that 
discards non-alphanumeric characters between word boundaries; and different 
type of analysis component that discards nothing.  I think of the 
discard-nothing process as *segmentation* rather than tokenization, and I've 
[argued for it 
previously|https://issues.apache.org/jira/browse/LUCENE-2498?focusedCommentId=12878963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12878963].
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should 
> leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Reply via email to