[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901104#action_12901104 ]
Uwe Schindler commented on SOLR-1860: ------------------------------------- If it's documented to be UTF-8, its clear what you have to provide (in Solr). If you use Lucene directly, the stopword file parser does not care about encodings at all, it simply takes a java.io..Reader. > improve stopwords list handling > ------------------------------- > > Key: SOLR-1860 > URL: https://issues.apache.org/jira/browse/SOLR-1860 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Affects Versions: 3.1 > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Attachments: SOLR-1860.patch > > > Currently Solr makes it easy to use english stopwords for StopFilter or > CommonGramsFilter. > Recently in lucene, we added stopwords lists (mostly, but not all from > snowball) to all the language analyzers. > So it would be nice if a user can easily specify that they want to use a > french stopword list, and use it for StopFilter or CommonGrams. > The ones from snowball, are however formatted in a different manner than the > others (although in Lucene we have parsers to deal with this). > Additionally, we abstract this from Lucene users by adding a static > getDefaultStopSet to all analyzers. > There are two approaches, the first one I think I prefer the most, but I'm > not sure it matters as long as we have good examples (maybe a foreign > language example schema?) > 1. The user would specify something like: > <filter class="solr.StopFilterFactory" > fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../> > This would just grab the CharArraySet from the FrenchAnalyzer's > getDefaultStopSet method, who cares where it comes from or how its loaded. > 2. We add support for snowball-formatted stopwords lists, and the user could > something like: > <filter class="solr.StopFilterFactory" > words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" > ... /> > The disadvantage to this is they have to know where the list is, what format > its in, etc. For example: snowball doesn't provide Romanian or Turkish > stopword lists to go along with their stemmers, so we had to add our own. > Let me know what you guys think, and I will create a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org