Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Bruno Roustant Tue, 02 Jul 2024 03:08:36 -0700

Should we keep the HOLDER.DEFAULT pattern to not create the default stop
set if not needed (when there is a custom building)?


Le mar. 2 juil. 2024 à 01:45, Chris Hostetter <[email protected]> a
écrit :

>
> : There's also one other problem with those sets: Unfortunately they are
> : modifiable, because they are not real "Set<String>" but CharArraySets.
> There
> : is no 100% unmodifiable view of them. This was the main reason why we
> did not
> : make them public for newer variants of analyzers. I think we should add
> : unmodifable "views" of CharArraySet, but this is also not 100% possible,
> as
> : the underlying char[] cannot be protected.
>
> Interesting.
>
> That makes it seem like Lucene should probably deprecate (and remove in
> 10) the 40+ existing "public static CharArraySet getDefaultSOMETHING()"
> methods that are documented like this...
>
>   /**
>    * Returns an unmodifiable instance of the default stop words set.
>    *
>    * @return default stop words set.
>    */
>   public static CharArraySet getDefaultStopSet() {
>
> ...but return direct access to a private static (not so) "unmodifiable"
> CharArraySet, and replace them with methods like...
>
>   /**
>    * Creates an instance of the default stop words set.
>    *
>    * @return default stop words set.
>    */
>   public static CharArraySet buildDefaultStopSet() {
>
> ...and invert the current (simplified) DEFAULT/getter logic from...
>
>   private static final CharArraySet DEFAULT;
>   static {
>     DEFAULT = WordlistLoader.getSnowballWordSet(...)
>   }
>   public static CharArraySet getDefaultStopSet() {
>     return HOLDER.DEFAULT
>   }
>
> ...to something like...
>
>   private static final CharArraySet DEFAULT = buildDefaultStopSet();
>   public static CharArraySet buildDefaultStopSet() {
>     return WordlistLoader.getSnowballWordSet(...);
>   }
>
> ... the net result for the Stock Analyzers would be the same, but that
> would eliminate the risk that anyone calling buildDefaultStopSet() from
> their custom Analyzer would break the stock Analyzer.  right?
>
>
> : > I am fine with this. But on the other hand: Why do you want to
> replicate the
> : > files into Solr's config folder? A Solr configuration should better be
> able
> : > to load the stopwords file from resources, too. I was always wondering
> why
>
> 1) We include copies of these files in the _default configset for the same
> reason we copy the _default configset when you create a collection: To
> help ensure backcompatiblity of behavior even if the _default configset
> (or resource files) change in a future version of Solr (or Lucene).  As
> long as you continue to use your "old" copy of your configset, you should
> (hopefully) get the consistent analysis behavior from your fieldtypes.
> The "new" _default configset may have a "better" set of stopwords, or
> stemming dictionary (or completley new list of token filters) for the
> "new" field type definition with the same names as the field type you are
> using, but it's your choice if/when to modify your schema to get that new
> behavior.
>
>
> 2) It's been a while since i personally tested it, but AFAIK the
> SolrResourceLoader will still happily look for a resource file in the
> classpath *after* it checks the configset -- if the configset did not
> contain a file with that path.  So it should certainly be
> possible for someone to put something like this in their schema...
>
> <filter name="stop" ignoreCase="true"
> words="org/apache/lucene/analysis/snowball/irish_stop.txt"/>
>
> ...but ignoring the backcompat reasosn mentioned above, telling a Solr
> user they can point a <filter/> directly at a file path from a LUcene jar
> doesn't help most of the situations I was asking about (plus a
> few more), because in these cases the stock Lucene
> Analyzers don't build the CharArraySet/CharArrayMap from a resource
> file in the classpath.  They are build from hardcoded java ararys in the
> sourcecode.
>
> If examples like the ones below were moved to resource files (and the FQN
> of the "default" resource file was available via a "public static final
> String") that would also certainly be an improvement in terms of
> reusability in
> custom subclasses (and Solr's ability to copy them on upgrade)...
>
>
> // IrishAnalyzer
> private static final CharArraySet DEFAULT_ARTICLES =
>   CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("d", "m",
> "b"), true));
> private static final CharArraySet HYPHENATIONS =
>   CharArraySet.unmodifiableSet(new CharArraySet(Arrays.asList("h", "n",
> "t"), true));
>
> // DutchAnalyzer
> DEFAULT_STEM_DICT = new CharArrayMap<>(4, false);
> DEFAULT_STEM_DICT.put("fiets", "fiets"); // otherwise fiet
> DEFAULT_STEM_DICT.put("bromfiets", "bromfiets"); // otherwise bromfiet
> DEFAULT_STEM_DICT.put("ei", "eier");
> DEFAULT_STEM_DICT.put("kind", "kinder");
>
> // EnglishAnalyzer
> final List<String> stopWords =
>   Arrays.asList(
>     "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",
> "in", "into", "is",
>     "it", "no", "not", "of", "on", "or", "such", "that", "the", "their",
> "then", "there",
>     "these", "they", "this", "to", "was", "will", "with");
> final CharArraySet stopSet = new CharArraySet(stopWords, false);
>
> // FrenchAnalyzer, ItalianAnalyzer, CatalanAnalyzer, etc...
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Should all 'static final' CharArray(Set|Map)s in stock Analyzers be "public" ?

Reply via email to