[ 
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853679#action_12853679
 ] 

Hoss Man commented on SOLR-1860:
--------------------------------

bq. I like the idea of an export - it's transparent and neatly handles back 
compat concerns.

that's the same conclusion robert and i came to on IRC ... being able to load 
directly sounds less redundent, but as soon as a user wants to customize (and 
let's face it: stop words can easily be domain specific) qe need a way of 
exporting that's convenient even for novice users who don't know anything about 
jars and wars.

bq. Not sure at this point if it makes more sense trying to put a text_fr, etc, 
in the normal schema.xml or in a separate schema_intl.xml.

The idea robert pitched on IRC was to create a new example solr-instance 
directory with a barebones solrconfig.xml file, and a schema.xml file that 
*only* demonstrated fields using various tricks for various lanagues.  All the 
language specific stopword files would then live in this new instancedir.  The 
idea being that people interested in non-english fields, could find a 
"recommended" fieldtype declaration in this schema.xml file, and cut/paste it 
to their schema.xml (probably copied from the main example)

The key here being that we don't want an entire clone of the example (all the 
numeric fields, and multiple request handler declarations,etc...)  this will 
just show the syntax for declaring all the various langages that we can provide 
suggestions for.

bq. As far as file format: I think we sould also support the snowball stopword 
format.

Agreed, but it's a trivially minor chicken/egg choice.  Either we can setup a 
simple export and conversion to the format Solr currently supports now, and 
if/when someon updates StopFilterFactory to support the new format, then we can 
stop converting when we export; or we can modify StopFilter to support both 
formats first, and then setup the simple export w/o worrying about conversion.  
 

Frankly: If Robert's planning on doing the work either way, I'm happy to let 
him decide which approach makes the most sense.


> improve stopwords list handling
> -------------------------------
>
>                 Key: SOLR-1860
>                 URL: https://issues.apache.org/jira/browse/SOLR-1860
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>
> Currently Solr makes it easy to use english stopwords for StopFilter or 
> CommonGramsFilter.
> Recently in lucene, we added stopwords lists (mostly, but not all from 
> snowball) to all the language analyzers.
> So it would be nice if a user can easily specify that they want to use a 
> french stopword list, and use it for StopFilter or CommonGrams.
> The ones from snowball, are however formatted in a different manner than the 
> others (although in Lucene we have parsers to deal with this).
> Additionally, we abstract this from Lucene users by adding a static 
> getDefaultStopSet to all analyzers.
> There are two approaches, the first one I think I prefer the most, but I'm 
> not sure it matters as long as we have good examples (maybe a foreign 
> language example schema?)
> 1. The user would specify something like:
>  <filter class="solr.StopFilterFactory" 
> fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
>  This would just grab the CharArraySet from the FrenchAnalyzer's 
> getDefaultStopSet method, who cares where it comes from or how its loaded.
> 2. We add support for snowball-formatted stopwords lists, and the user could 
> something like:
> <filter class="solr.StopFilterFactory" 
> words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" 
> ... />
> The disadvantage to this is they have to know where the list is, what format 
> its in, etc. For example: snowball doesn't provide Romanian or Turkish
> stopword lists to go along with their stemmers, so we had to add our own.
> Let me know what you guys think, and I will create a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to