[
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200680#comment-13200680
]
Christian Moen commented on LUCENE-3745:
----------------------------------------
Please find a patch attached.
I've made {{stoptags.txt}} lighter by not stopping all prefixes and also
allowing auxiliary verbs and interjections to pass. I didn't come across any
occurrences of unclassified symbols (記号) in Wikipedia, but it is now stopped as
that seem to align better with our overall stop approach for symbols.
Many of the most frequent terms that now pass have been re-introduced in
{{stopwords.txt} so they are stopped using a {{StopFilter}} instead of
{{KuromojiPartOfSpeechStopFilter}}. I believe this configuration is more
balanced.
Overall, I've used the term frequencies attached to as a governing guideline
for what to introduce into {{stopwords.txt}}. It mostly contains hiragana
words and expressions and I've deliberately left out common kanji as I'd like
to keep the stopping fairly light.
I'll create a separate JIRA for introducing stopwords and stoptags to Solr.
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
> Key: LUCENE-3745
> URL: https://issues.apache.org/jira/browse/LUCENE-3745
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Christian Moen
> Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt,
> top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and
> integrated into Lucene.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]