Re: [PR] OPENNLP-660: Include list of stop words for various languages (opennlp)

via GitHub Thu, 21 May 2026 05:40:50 -0700


krickert commented on code in PR #1056:
URL: https://github.com/apache/opennlp/pull/1056#discussion_r3281188045



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/stopword/StopwordFilteringTokenizer.java:
##########
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.stopword;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.commons.ThreadSafe;
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A {@link Tokenizer} decorator which delegates tokenization to a wrapped
+ * {@link Tokenizer} and then removes any tokens identified as stopwords by
+ * the supplied {@link StopwordFilter}.
+ * <p>
+ * Both {@link #tokenize(String)} and {@link #tokenizePos(String)} apply the
+ * filter: in the latter case the {@link Span Spans} whose covered text is a
+ * stopword are dropped while the offsets of the remaining spans are kept
+ * intact (they continue to refer to positions in the original input string).
+ * <p>
+ * Instances are immutable and therefore safe for concurrent use provided that
+ * both the wrapped {@link Tokenizer} and the {@link StopwordFilter} are
+ * thread-safe. {@link DictionaryStopwordFilter} is unconditionally
+ * thread-safe; combined with a thread-safe delegate tokenizer
+ * (e.g. {@code SimpleTokenizer.INSTANCE}) the resulting decorator is
+ * thread-safe with no further synchronization required.
+ */
+@ThreadSafe
+public final class StopwordFilteringTokenizer implements Tokenizer {
+
+  private final Tokenizer delegate;
+  private final StopwordFilter filter;
+
+  /**
+   * Initializes a {@link StopwordFilteringTokenizer}.
+   *
+   * @param delegate The underlying {@link Tokenizer} that produces the raw
+   *                 tokens. Must not be {@code null}.
+   * @param filter   The {@link StopwordFilter} which decides whether a token
+   *                 is a stopword. Must not be {@code null}.
+   * @throws IllegalArgumentException if {@code delegate} or {@code filter} is
+   *                                  {@code null}.
+   */
+  public StopwordFilteringTokenizer(final Tokenizer delegate, final 
StopwordFilter filter) {
+    if (delegate == null) {
+      throw new IllegalArgumentException("delegate must not be null");
+    }
+    if (filter == null) {
+      throw new IllegalArgumentException("filter must not be null");
+    }
+    this.delegate = delegate;
+    this.filter = filter;
+  }
+
+  /**
+   * Tokenizes the supplied string with the wrapped {@link Tokenizer} and then
+   * removes any tokens which the {@link StopwordFilter} considers a stopword.
+   *
+   * @param s The string to be tokenized.
+   * @return  The remaining tokens in their original order.
+   */
+  @Override
+  public String[] tokenize(final String s) {
+    return filter.filter(delegate.tokenize(s));
+  }
+
+  /**
+   * Computes token spans with the wrapped {@link Tokenizer} and then drops
+   * any span whose covered text is a stopword according to the
+   * {@link StopwordFilter}. The relative order and the offsets of the
+   * surviving spans are preserved.
+   *
+   * @param s The string to be tokenized.
+   * @return  The remaining {@link Span Spans} in their original order.
+   */
+  @Override
+  public Span[] tokenizePos(final String s) {

Review Comment:
   tokenize(String) delegates to filter(), so multi-word stopword entries are 
handled with the longest-match scan. tokenizePos(String) only calls isStopword 
on each span's covered text, which is effectively single-token matching.
   
   If a custom list includes multi-word entries, those phrases will not be 
removed when using tokenizePos, even though tokenize and StopwordFilterStream 
would drop them. Either apply the same window logic here (and adjust offsets if 
needed), or document clearly that span-based tokenization only supports 1-gram 
stopwords.
   
   Bundled lists are mostly one token per line, so this may be low impact for 
the default resources, but the three code paths behave differently today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-660: Include list of stop words for various languages (opennlp)

Reply via email to