Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Chris Hostetter Tue, 02 Jul 2024 10:12:22 -0700


: Should we keep the HOLDER.DEFAULT pattern to not create the default stop
: set if not needed (when there is a custom building)?


I did not mean to imply that i think we eliminate the HOLDER 
pattern/optimization -- i just didn't include it in my "(simplified)" 
example to try and focus on the my core question: wether we should invert 
the current "public get() returns DEFAULT" relationship to a "DEFAULT 
initialized using public build()" relationship.


: > : There's also one other problem with those sets: Unfortunately they are
: > : modifiable, because they are not real "Set<String>" but CharArraySets.
: > There
: > : is no 100% unmodifiable view of them. This was the main reason why we
: > did not
: > : make them public for newer variants of analyzers. I think we should add
: > : unmodifable "views" of CharArraySet, but this is also not 100% possible,
: > as
: > : the underlying char[] cannot be protected.
: >
: > Interesting.
: >
: > That makes it seem like Lucene should probably deprecate (and remove in
: > 10) the 40+ existing "public static CharArraySet getDefaultSOMETHING()"
: > methods that are documented like this...
: >
: >   /**
: >    * Returns an unmodifiable instance of the default stop words set.
: >    *
: >    * @return default stop words set.
: >    */
: >   public static CharArraySet getDefaultStopSet() {
: >
: > ...but return direct access to a private static (not so) "unmodifiable"
: > CharArraySet, and replace them with methods like...
: >
: >   /**
: >    * Creates an instance of the default stop words set.
: >    *
: >    * @return default stop words set.
: >    */
: >   public static CharArraySet buildDefaultStopSet() {
: >
: > ...and invert the current (simplified) DEFAULT/getter logic from...
: >
: >   private static final CharArraySet DEFAULT;
: >   static {
: >     DEFAULT = WordlistLoader.getSnowballWordSet(...)
: >   }
: >   public static CharArraySet getDefaultStopSet() {
: >     return HOLDER.DEFAULT
: >   }
: >
: > ...to something like...
: >
: >   private static final CharArraySet DEFAULT = buildDefaultStopSet();
: >   public static CharArraySet buildDefaultStopSet() {
: >     return WordlistLoader.getSnowballWordSet(...);
: >   }
: >
: > ... the net result for the Stock Analyzers would be the same, but that
: > would eliminate the risk that anyone calling buildDefaultStopSet() from
: > their custom Analyzer would break the stock Analyzer.  right?
: >
: >
: > : > I am fine with this. But on the other hand: Why do you want to
: > replicate the
: > : > files into Solr's config folder? A Solr configuration should better be
: > able
: > : > to load the stopwords file from resources, too. I was always wondering
: > why
: >
: > 1) We include copies of these files in the _default configset for the same
: > reason we copy the _default configset when you create a collection: To
: > help ensure backcompatiblity of behavior even if the _default configset
: > (or resource files) change in a future version of Solr (or Lucene).  As
: > long as you continue to use your "old" copy of your configset, you should
: > (hopefully) get the consistent analysis behavior from your fieldtypes.
: > The "new" _default configset may have a "better" set of stopwords, or
: > stemming dictionary (or completley new list of token filters) for the
: > "new" field type definition with the same names as the field type you are
: > using, but it's your choice if/when to modify your schema to get that new
: > behavior.
: >
: >
: > 2) It's been a while since i personally tested it, but AFAIK the
: > SolrResourceLoader will still happily look for a resource file in the
: > classpath *after* it checks the configset -- if the configset did not
: > contain a file with that path.  So it should certainly be
: > possible for someone to put something like this in their schema...
: >
: > <filter name="stop" ignoreCase="true"
: > words="org/apache/lucene/analysis/snowball/irish_stop.txt"/>
: >
: > ...but ignoring the backcompat reasosn mentioned above, telling a Solr
: > user they can point a <filter/> directly at a file path from a LUcene jar
: > doesn't help most of the situations I was asking about (plus a
: > few more), because in these cases the stock Lucene
: > Analyzers don't build the CharArraySet/CharArrayMap from a resource
: > file in the classpath.  They are build from hardcoded java ararys in the
: > sourcecode.
: >
: > If examples like the ones below were moved to resource files (and the FQN
: > of the "default" resource file was available via a "public static final
: > String") that would also certainly be an improvement in terms of
: > reusability in
: > custom subclasses (and Solr's ability to copy them on upgrade)...
: >
: >
: > // IrishAnalyzer
: > private static final CharArraySet DEFAULT_ARTICLES =
: >   CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("d", "m",
: > "b"), true));
: > private static final CharArraySet HYPHENATIONS =
: >   CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("h", "n",
: > "t"), true));
: >
: > // DutchAnalyzer
: > DEFAULT_STEM_DICT = new CharArrayMap<>(4, false);
: > DEFAULT_STEM_DICT.put("fiets", "fiets"); // otherwise fiet
: > DEFAULT_STEM_DICT.put("bromfiets", "bromfiets"); // otherwise bromfiet
: > DEFAULT_STEM_DICT.put("ei", "eier");
: > DEFAULT_STEM_DICT.put("kind", "kinder");
: >
: > // EnglishAnalyzer
: > final List<String> stopWords =
: >   Arrays.asList(
: >     "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",
: > "in", "into", "is",
: >     "it", "no", "not", "of", "on", "or", "such", "that", "the", "their",
: > "then", "there",
: >     "these", "they", "this", "to", "was", "will", "with");
: > final CharArraySet stopSet = new CharArraySet(stopWords, false);
: >
: > // FrenchAnalyzer, ItalianAnalyzer, CatalanAnalyzer, etc...
: >
: >
: >
: > -Hoss
: > http://www.lucidworks.com/
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: > For additional commands, e-mail: dev-h...@lucene.apache.org
: >
: >
: 

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Reply via email to