Should we keep the HOLDER.DEFAULT pattern to not create the default stop set if not needed (when there is a custom building)?
Le mar. 2 juil. 2024 à 01:45, Chris Hostetter <hossman_luc...@fucit.org> a écrit : > > : There's also one other problem with those sets: Unfortunately they are > : modifiable, because they are not real "Set<String>" but CharArraySets. > There > : is no 100% unmodifiable view of them. This was the main reason why we > did not > : make them public for newer variants of analyzers. I think we should add > : unmodifable "views" of CharArraySet, but this is also not 100% possible, > as > : the underlying char[] cannot be protected. > > Interesting. > > That makes it seem like Lucene should probably deprecate (and remove in > 10) the 40+ existing "public static CharArraySet getDefaultSOMETHING()" > methods that are documented like this... > > /** > * Returns an unmodifiable instance of the default stop words set. > * > * @return default stop words set. > */ > public static CharArraySet getDefaultStopSet() { > > ...but return direct access to a private static (not so) "unmodifiable" > CharArraySet, and replace them with methods like... > > /** > * Creates an instance of the default stop words set. > * > * @return default stop words set. > */ > public static CharArraySet buildDefaultStopSet() { > > ...and invert the current (simplified) DEFAULT/getter logic from... > > private static final CharArraySet DEFAULT; > static { > DEFAULT = WordlistLoader.getSnowballWordSet(...) > } > public static CharArraySet getDefaultStopSet() { > return HOLDER.DEFAULT > } > > ...to something like... > > private static final CharArraySet DEFAULT = buildDefaultStopSet(); > public static CharArraySet buildDefaultStopSet() { > return WordlistLoader.getSnowballWordSet(...); > } > > ... the net result for the Stock Analyzers would be the same, but that > would eliminate the risk that anyone calling buildDefaultStopSet() from > their custom Analyzer would break the stock Analyzer. right? > > > : > I am fine with this. But on the other hand: Why do you want to > replicate the > : > files into Solr's config folder? A Solr configuration should better be > able > : > to load the stopwords file from resources, too. I was always wondering > why > > 1) We include copies of these files in the _default configset for the same > reason we copy the _default configset when you create a collection: To > help ensure backcompatiblity of behavior even if the _default configset > (or resource files) change in a future version of Solr (or Lucene). As > long as you continue to use your "old" copy of your configset, you should > (hopefully) get the consistent analysis behavior from your fieldtypes. > The "new" _default configset may have a "better" set of stopwords, or > stemming dictionary (or completley new list of token filters) for the > "new" field type definition with the same names as the field type you are > using, but it's your choice if/when to modify your schema to get that new > behavior. > > > 2) It's been a while since i personally tested it, but AFAIK the > SolrResourceLoader will still happily look for a resource file in the > classpath *after* it checks the configset -- if the configset did not > contain a file with that path. So it should certainly be > possible for someone to put something like this in their schema... > > <filter name="stop" ignoreCase="true" > words="org/apache/lucene/analysis/snowball/irish_stop.txt"/> > > ...but ignoring the backcompat reasosn mentioned above, telling a Solr > user they can point a <filter/> directly at a file path from a LUcene jar > doesn't help most of the situations I was asking about (plus a > few more), because in these cases the stock Lucene > Analyzers don't build the CharArraySet/CharArrayMap from a resource > file in the classpath. They are build from hardcoded java ararys in the > sourcecode. > > If examples like the ones below were moved to resource files (and the FQN > of the "default" resource file was available via a "public static final > String") that would also certainly be an improvement in terms of > reusability in > custom subclasses (and Solr's ability to copy them on upgrade)... > > > // IrishAnalyzer > private static final CharArraySet DEFAULT_ARTICLES = > CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("d", "m", > "b"), true)); > private static final CharArraySet HYPHENATIONS = > CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("h", "n", > "t"), true)); > > // DutchAnalyzer > DEFAULT_STEM_DICT = new CharArrayMap<>(4, false); > DEFAULT_STEM_DICT.put("fiets", "fiets"); // otherwise fiet > DEFAULT_STEM_DICT.put("bromfiets", "bromfiets"); // otherwise bromfiet > DEFAULT_STEM_DICT.put("ei", "eier"); > DEFAULT_STEM_DICT.put("kind", "kinder"); > > // EnglishAnalyzer > final List<String> stopWords = > Arrays.asList( > "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", > "in", "into", "is", > "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", > "then", "there", > "these", "they", "this", "to", "was", "will", "with"); > final CharArraySet stopSet = new CharArraySet(stopWords, false); > > // FrenchAnalyzer, ItalianAnalyzer, CatalanAnalyzer, etc... > > > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >