Re: [PR] OPENNLP-1864: Per-language NormalizationProfile registry (2c/7) (opennlp)

via GitHub Fri, 03 Jul 2026 15:51:12 -0700


krickert commented on code in PR #1112:
URL: https://github.com/apache/opennlp/pull/1112#discussion_r3522090076



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import opennlp.tools.stemmer.Stemmer;
+import opennlp.tools.stemmer.snowball.SnowballStemmer;
+
+/**
+ * Per-language normalization settings, mirroring how OpenNLP already selects 
a Snowball stemmer by
+ * language. A profile pairs a language with its Snowball {@link 
SnowballStemmer.ALGORITHM} and the
+ * diacritic fold appropriate for that language (if any).
+ *
+ * <p>The {@code accentFold} normalizer is the language's diacritic transform 
for a matching form, or
+ * {@code null} when folding is not appropriate. It is the generic
+ * {@link AccentFoldCharSequenceNormalizer} for English and the major Romance 
languages (where
+ * accented letters are matching variants of their base letter), the 
German-specific
+ * {@link GermanUmlautCharSequenceNormalizer} (a-umlaut to {@code ae}, eszett 
to {@code ss}, ...) for
+ * German, and {@code null} where diacritics mark distinct letters (the Nordic 
languages and the
+ * non-Latin scripts), because folding there is language-wrong. This is a 
search-recall choice, not a
+ * statement of linguistic correctness; callers can build a {@link 
TermAnalyzer} directly to
+ * override it.</p>
+ *
+ * @param language         The language, as an ISO 639-3 code (for example 
{@code "eng"}).
+ * @param stemmerAlgorithm The Snowball algorithm for the language.
+ * @param accentFold       The diacritic fold for the language, or {@code 
null} for none.
+ */
+public record NormalizationProfile(String language, SnowballStemmer.ALGORITHM 
stemmerAlgorithm,

Review Comment:
   Done. Added a compact constructor that rejects a null or blank language and 
a null stemmerAlgorithm; accentFold stays nullable by design. Test: 
testProfileRejectsInvalidComponents. (19ae614d)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1864: Per-language NormalizationProfile registry (2c/7) (opennlp)

Reply via email to