: There's also one other problem with those sets: Unfortunately they are
: modifiable, because they are not real "Set<String>" but CharArraySets. There
: is no 100% unmodifiable view of them. This was the main reason why we did not
: make them public for newer variants of analyzers. I think we should add
: unmodifable "views" of CharArraySet, but this is also not 100% possible, as
: the underlying char[] cannot be protected.

Interesting. 

That makes it seem like Lucene should probably deprecate (and remove in 
10) the 40+ existing "public static CharArraySet getDefaultSOMETHING()" 
methods that are documented like this...

  /**
   * Returns an unmodifiable instance of the default stop words set.
   *
   * @return default stop words set.
   */
  public static CharArraySet getDefaultStopSet() {

...but return direct access to a private static (not so) "unmodifiable" 
CharArraySet, and replace them with methods like...

  /**
   * Creates an instance of the default stop words set.
   *
   * @return default stop words set.
   */
  public static CharArraySet buildDefaultStopSet() {

...and invert the current (simplified) DEFAULT/getter logic from...

  private static final CharArraySet DEFAULT;
  static {
    DEFAULT = WordlistLoader.getSnowballWordSet(...)
  }
  public static CharArraySet getDefaultStopSet() { 
    return HOLDER.DEFAULT
  }

...to something like...

  private static final CharArraySet DEFAULT = buildDefaultStopSet();
  public static CharArraySet buildDefaultStopSet() {
    return WordlistLoader.getSnowballWordSet(...);
  }

... the net result for the Stock Analyzers would be the same, but that 
would eliminate the risk that anyone calling buildDefaultStopSet() from 
their custom Analyzer would break the stock Analyzer.  right?


: > I am fine with this. But on the other hand: Why do you want to replicate the
: > files into Solr's config folder? A Solr configuration should better be able
: > to load the stopwords file from resources, too. I was always wondering why

1) We include copies of these files in the _default configset for the same 
reason we copy the _default configset when you create a collection: To 
help ensure backcompatiblity of behavior even if the _default configset 
(or resource files) change in a future version of Solr (or Lucene).  As 
long as you continue to use your "old" copy of your configset, you should 
(hopefully) get the consistent analysis behavior from your fieldtypes.  
The "new" _default configset may have a "better" set of stopwords, or 
stemming dictionary (or completley new list of token filters) for the 
"new" field type definition with the same names as the field type you are 
using, but it's your choice if/when to modify your schema to get that new 
behavior.


2) It's been a while since i personally tested it, but AFAIK the 
SolrResourceLoader will still happily look for a resource file in the 
classpath *after* it checks the configset -- if the configset did not 
contain a file with that path.  So it should certainly be 
possible for someone to put something like this in their schema...

<filter name="stop" ignoreCase="true" 
words="org/apache/lucene/analysis/snowball/irish_stop.txt"/>

...but ignoring the backcompat reasosn mentioned above, telling a Solr 
user they can point a <filter/> directly at a file path from a LUcene jar 
doesn't help most of the situations I was asking about (plus a 
few more), because in these cases the stock Lucene 
Analyzers don't build the CharArraySet/CharArrayMap from a resource 
file in the classpath.  They are build from hardcoded java ararys in the 
sourcecode.  

If examples like the ones below were moved to resource files (and the FQN 
of the "default" resource file was available via a "public static final 
String") that would also certainly be an improvement in terms of reusability in 
custom subclasses (and Solr's ability to copy them on upgrade)...


// IrishAnalyzer
private static final CharArraySet DEFAULT_ARTICLES =
  CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("d", "m", "b"), 
true));
private static final CharArraySet HYPHENATIONS =
  CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("h", "n", "t"), 
true));

// DutchAnalyzer
DEFAULT_STEM_DICT = new CharArrayMap<>(4, false);
DEFAULT_STEM_DICT.put("fiets", "fiets"); // otherwise fiet
DEFAULT_STEM_DICT.put("bromfiets", "bromfiets"); // otherwise bromfiet
DEFAULT_STEM_DICT.put("ei", "eier");
DEFAULT_STEM_DICT.put("kind", "kinder");

// EnglishAnalyzer
final List<String> stopWords =
  Arrays.asList(
    "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", 
"into", "is",
    "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", 
"then", "there",
    "these", "they", "this", "to", "was", "will", "with");
final CharArraySet stopSet = new CharArraySet(stopWords, false);

// FrenchAnalyzer, ItalianAnalyzer, CatalanAnalyzer, etc...



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to