krickert commented on code in PR #1056:
URL: https://github.com/apache/opennlp/pull/1056#discussion_r3281282120


##########
opennlp-core/opennlp-cli/src/main/java/opennlp/tools/cmdline/stopword/StopwordFilterTool.java:
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.cmdline.stopword;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.PrintWriter;
+import java.nio.charset.StandardCharsets;
+
+import opennlp.tools.cmdline.BasicCmdLineTool;
+import opennlp.tools.cmdline.CLI;
+import opennlp.tools.cmdline.TerminateToolException;
+import opennlp.tools.stopword.StopwordFilter;
+import opennlp.tools.stopword.StopwordLists;
+
+/**
+ * A command line tool that filters stop words from whitespace-separated
+ * tokens read on standard input and prints the kept tokens to standard
+ * output, one input line per output line.
+ *
+ * <p>Usage: {@code opennlp StopwordFilter <lang>}, where {@code <lang>}
+ * is an ISO 639 language code matching one of the bundled lists.
+ */
+public final class StopwordFilterTool extends BasicCmdLineTool {
+
+  @Override
+  public String getShortDescription() {
+    return "filters stop words from tokens read on stdin";
+  }
+
+  @Override
+  public String getHelp() {
+    return "Usage: " + CLI.CMD + " " + getName() + " <lang>\n"
+        + "  <lang> ISO 639 code; supported: " + 
StopwordLists.supportedLanguages();
+  }
+
+  @Override
+  public boolean hasParams() {
+    return true;
+  }
+
+  @Override
+  public void run(final String[] args) {
+    if (args.length != 1) {
+      System.out.println(getHelp());
+      return;
+    }
+
+    final StopwordFilter filter = StopwordLists.forLanguage(args[0]);

Review Comment:
   The docs describe custom lists via StopwordLists.load, but the CLI only 
accepts a bundled <lang>. I think the CLI should match that: let users pass a 
file path or pipe a list on stdin, similar to other OpenNLP tools. Thoughts?



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/stopword/DictionaryStopwordFilter.java:
##########
@@ -0,0 +1,422 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.stopword;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+import java.io.UncheckedIOException;
+import java.nio.charset.Charset;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+
+import opennlp.tools.commons.ThreadSafe;
+import opennlp.tools.dictionary.Dictionary;
+import opennlp.tools.util.StringList;
+
+/**
+ * An immutable, thread-safe {@link StopwordFilter} backed by an OpenNLP
+ * {@link Dictionary}.
+ * <p>
+ * The backing store supports both 1-gram and n-gram entries. Multi-word
+ * entries are queried via {@link #isStopword(String...)}; the
+ * {@link #filter(String[])} method performs a greedy left-to-right window
+ * scan, preferring the longest registered match at each position.
+ * <p>
+ * Instances are constructed once and never modified afterwards. Use the
+ * {@link Builder} ({@link #builder()}) to assemble a filter from one or
+ * more sources (programmatic entries, an input stream, an existing
+ * {@link Dictionary}), or the public constructors for the common cases.
+ * <p>
+ * <strong>Thread-safety:</strong> instances are immutable after
+ * construction and may be shared freely across threads without external
+ * synchronization. All fields are {@code final}; the only mutation of the
+ * backing {@link Dictionary} happens inside the constructor / builder before
+ * the instance is published.
+ */
+@ThreadSafe
+public final class DictionaryStopwordFilter implements StopwordFilter {
+
+  private static final String COMMENT_PREFIX = "#";
+
+  private final Dictionary backing;
+
+  /**
+   * Loads a stopword list from the given input stream and freezes it into
+   * an immutable filter.
+   * <p>
+   * Format: UTF-8 (or the supplied {@link Charset}), one entry per line.
+   * Whitespace-separated tokens on the same line form one multi-word entry.
+   * Blank lines and lines starting with {@code #} are skipped.
+   *
+   * @param in The input stream to read from. Must not be {@code null}.
+   * @param cs The {@link Charset} to decode with. Must not be {@code null}.
+   * @param caseSensitive Whether matching is case-sensitive.
+   * @throws IllegalArgumentException if {@code in} or {@code cs} is
+   *     {@code null}.
+   * @throws IOException Thrown if an IO error occurs while reading.
+   */
+  public DictionaryStopwordFilter(final InputStream in, final Charset cs,
+                                  final boolean caseSensitive) throws 
IOException {
+    if (in == null) {
+      throw new IllegalArgumentException("in must not be null");
+    }
+    if (cs == null) {
+      throw new IllegalArgumentException("cs must not be null");
+    }
+    this.backing = parseStream(in, cs, caseSensitive);
+  }
+
+  /**
+   * Creates an immutable filter from a defensive copy of {@code source}.
+   * Subsequent mutation of {@code source} does not affect this filter.
+   *
+   * @param source The dictionary whose contents seed the filter. Must not
+   *     be {@code null}.
+   * @throws IllegalArgumentException if {@code source} is {@code null}.
+   */
+  public DictionaryStopwordFilter(final Dictionary source) {
+    if (source == null) {
+      throw new IllegalArgumentException("source must not be null");
+    }
+    final Dictionary copy = new Dictionary(source.isCaseSensitive());
+    for (final StringList entry : source) {
+      copy.put(entry);
+    }
+    this.backing = copy;
+  }
+
+  private DictionaryStopwordFilter(final Dictionary internal, final boolean 
owned) {

Review Comment:
   Small cleanup: the private constructor takes a boolean owned that is never 
used. We should drop the parameter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to