[ 
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200680#comment-13200680
 ] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

Please find a patch attached.

I've made {{stoptags.txt}} lighter by not stopping all prefixes and also 
allowing auxiliary verbs and interjections to pass.  I didn't come across any 
occurrences of unclassified symbols (記号) in Wikipedia, but it is now stopped as 
that seem to align better with our overall stop approach for symbols.

Many of the most frequent terms that now pass have been re-introduced in 
{{stopwords.txt} so they are stopped using a {{StopFilter}} instead of 
{{KuromojiPartOfSpeechStopFilter}}.  I believe this configuration is more 
balanced.

Overall, I've used the term frequencies attached to as a governing guideline 
for what to introduce into {{stopwords.txt}}.  It mostly contains hiragana 
words and expressions and I've deliberately left out common kanji as I'd like 
to keep the stopping fairly light.

I'll create a separate JIRA for introducing stopwords and stoptags to Solr.
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, 
> top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and 
> integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to