[jira] Commented: (SOLR-1860) improve stopwords list handling

Yonik Seeley (JIRA) Sat, 03 Apr 2010 07:03:51 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853135#action_12853135
 ]


Yonik Seeley commented on SOLR-1860:
------------------------------------

How many languages are we talking? 

 I like the idea of an export - it's transparent and neatly handles back compat 
concerns.
To avoid clutter, putting them all in a separate directory seems like a good 
idea:
/conf/stopwords/stopwords_en.txt
/conf/stopwords/stopwords_fr.txt

Or will there be other per-language files?  If so, maybe
/conf/lang/stopwords_en.txt
/conf/lang/protected_en.txt
/conf/lang/synonyms_en.txt

As far as file format: I think we sould also support the snowball stopword 
format.

Not sure at this point if it makes more sense trying to put a text_fr, etc, in 
the normal schema.xml or in a separate schema_intl.xml.   Partly depends on the 
number of text_<lang> types and resource usage I guess... need to consider 
things like core load time, etc.
We may want to think about lazy-loaded analyzers (but that could be another 
ball of wax since misconfigurations don't immediately fail).

> improve stopwords list handling
> -------------------------------
>
>                 Key: SOLR-1860
>                 URL: https://issues.apache.org/jira/browse/SOLR-1860
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>
> Currently Solr makes it easy to use english stopwords for StopFilter or 
> CommonGramsFilter.
> Recently in lucene, we added stopwords lists (mostly, but not all from 
> snowball) to all the language analyzers.
> So it would be nice if a user can easily specify that they want to use a 
> french stopword list, and use it for StopFilter or CommonGrams.
> The ones from snowball, are however formatted in a different manner than the 
> others (although in Lucene we have parsers to deal with this).
> Additionally, we abstract this from Lucene users by adding a static 
> getDefaultStopSet to all analyzers.
> There are two approaches, the first one I think I prefer the most, but I'm 
> not sure it matters as long as we have good examples (maybe a foreign 
> language example schema?)
> 1. The user would specify something like:
>  <filter class="solr.StopFilterFactory" 
> fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
>  This would just grab the CharArraySet from the FrenchAnalyzer's 
> getDefaultStopSet method, who cares where it comes from or how its loaded.
> 2. We add support for snowball-formatted stopwords lists, and the user could 
> something like:
> <filter class="solr.StopFilterFactory" 
> words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" 
> ... />
> The disadvantage to this is they have to know where the list is, what format 
> its in, etc. For example: snowball doesn't provide Romanian or Turkish
> stopword lists to go along with their stemmers, so we had to add our own.
> Let me know what you guys think, and I will create a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1860) improve stopwords list handling

Reply via email to