[
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243784#comment-13243784
]
Steven Rowe commented on LUCENE-3940:
-------------------------------------
bq. I think its well accepted that words carry the information content of a
doc, punctuation has no information content really here, it doesn't tell me
what the doc is about, and I don't think this is controversial, I just think
your view on this is extreme...
I disagree with you, Robert. (If punctuation has no information content, why
does it exist?) IMHO Mike's examples are not at all extreme, e.g. some
punctuation tokens could be used to trigger position increment gaps.
bq. StandardTokenizer doesnt leave holes when it drops punctuation, I think
holes should only be real 'words' for the most part
"Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word
*boundaries*. Lucene has grafted onto these boundary rules an assumption that
only alphanumeric "words" should be tokens - this assumption does not exist in
the standard itself.
My opinion is that we should have both types of things: a tokenizer that
discards non-alphanumeric characters between word boundaries; and different
type of analysis component that discards nothing. I think of the
discard-nothing process as *segmentation* rather than tokenization, and I've
[argued for it
previously|https://issues.apache.org/jira/browse/LUCENE-2498?focusedCommentId=12878963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12878963].
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should
> leave a hole
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3940
> URL: https://issues.apache.org/jira/browse/LUCENE-3940
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]