[ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200680#comment-13200680 ]
Christian Moen commented on LUCENE-3745: ---------------------------------------- Please find a patch attached. I've made {{stoptags.txt}} lighter by not stopping all prefixes and also allowing auxiliary verbs and interjections to pass. I didn't come across any occurrences of unclassified symbols (記号) in Wikipedia, but it is now stopped as that seem to align better with our overall stop approach for symbols. Many of the most frequent terms that now pass have been re-introduced in {{stopwords.txt} so they are stopped using a {{StopFilter}} instead of {{KuromojiPartOfSpeechStopFilter}}. I believe this configuration is more balanced. Overall, I've used the term frequencies attached to as a governing guideline for what to introduce into {{stopwords.txt}}. It mostly contains hiragana words and expressions and I've deliberately left out common kanji as I'd like to keep the stopping fairly light. I'll create a separate JIRA for introducing stopwords and stoptags to Solr. > Need stopwords and stoptags lists for default Japanese configuration > -------------------------------------------------------------------- > > Key: LUCENE-3745 > URL: https://issues.apache.org/jira/browse/LUCENE-3745 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis > Reporter: Christian Moen > Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, > top-1000000-pos.txt, top-pos.txt > > > Stopwords and stoptags lists for Japanese needs to be developed, tested and > integrated into Lucene. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org