krickert commented on code in PR #1112: URL: https://github.com/apache/opennlp/pull/1112#discussion_r3522090076
########## opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java: ########## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.normalizer; + +import opennlp.tools.stemmer.Stemmer; +import opennlp.tools.stemmer.snowball.SnowballStemmer; + +/** + * Per-language normalization settings, mirroring how OpenNLP already selects a Snowball stemmer by + * language. A profile pairs a language with its Snowball {@link SnowballStemmer.ALGORITHM} and the + * diacritic fold appropriate for that language (if any). + * + * <p>The {@code accentFold} normalizer is the language's diacritic transform for a matching form, or + * {@code null} when folding is not appropriate. It is the generic + * {@link AccentFoldCharSequenceNormalizer} for English and the major Romance languages (where + * accented letters are matching variants of their base letter), the German-specific + * {@link GermanUmlautCharSequenceNormalizer} (a-umlaut to {@code ae}, eszett to {@code ss}, ...) for + * German, and {@code null} where diacritics mark distinct letters (the Nordic languages and the + * non-Latin scripts), because folding there is language-wrong. This is a search-recall choice, not a + * statement of linguistic correctness; callers can build a {@link TermAnalyzer} directly to + * override it.</p> + * + * @param language The language, as an ISO 639-3 code (for example {@code "eng"}). + * @param stemmerAlgorithm The Snowball algorithm for the language. + * @param accentFold The diacritic fold for the language, or {@code null} for none. + */ +public record NormalizationProfile(String language, SnowballStemmer.ALGORITHM stemmerAlgorithm, Review Comment: Done. Added a compact constructor that rejects a null or blank language and a null stemmerAlgorithm; accentFold stays nullable by design. Test: testProfileRejectsInvalidComponents. (19ae614d) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
