[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Robert Muir (Commented) (JIRA) Mon, 02 Apr 2012 04:15:49 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244124#comment-13244124
 ]


Robert Muir commented on LUCENE-3940:
-------------------------------------

{quote}
This is certainly useful in the case of information extraction. For example, if 
we'd like to extract noun-phrases based on part-of-speech tags, we don't want 
to conjoin tokens in case there's a punctuation character between two nouns 
(unless that punctuation character is a middle dot).
{quote}

The option still exists in kuromoji (discardPunctuation=false) if you want to 
use it for this.
I added this option because it originally kept the punctuation (for downstream 
filters to remove).

lucene-gosen worked the same way, and after some time i saw *numerous* examples 
across the internet
where people simply configured the tokenizer without any filters, which means 
huge amounts of 
punctuation being indexed by default. People don't pay attention to 
documentation or details,
so all these people were getting bad performance.

Based on this experience, I didn't want keeping punctuation to be the default, 
nor even achievable
via things like solr factories here. But i added the (expert) option to 
Kuromoji because its really 
a more general purpose things for japanese analysis, because its already being 
used for other things,
and because allowing a boolean was not expensive or complex to support.

But I don't consider this a bonafied option from the lucene apis, i would be 
strongly against adding
this to the solr factories, or as an option to KuromojiAnalyzer. And, I don't 
think we should add such
a thing to other tokenizers either. 

Our general mission is search, if we want to decide we are expanding our 
analysis API to be generally
useful outside of information retrieval, thats a much bigger decision with more 
complex tradeoffs that
everyone should vote on (e.g. moving analyzers outside of lucene.apache.org and 
everything).


                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should 
> leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, 
> LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Reply via email to