Re: [PR] OPENNLP-53: Add Parse.createFromTokens() for convenient tokenized input (opennlp)

via GitHub Mon, 06 Apr 2026 21:35:50 -0700


mawiesne commented on code in PR #1012:
URL: https://github.com/apache/opennlp/pull/1012#discussion_r3042868169



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/parser/ParseTest.java:
##########
@@ -127,4 +129,51 @@ void testGetTagNodes() {
     Assertions.assertEquals("NN", tags[16].getType());
     Assertions.assertEquals(".", tags[17].getType());
   }
+
+  @Test
+  void testCreateFromTokens() {
+    String[] tokens = {"The", "cat", "sat", "on", "the", "mat"};
+    Parse p = Parse.createFromTokens(tokens);
+
+    // Verify text is space-joined
+    Assertions.assertEquals("The cat sat on the mat", p.getText());
+
+    // Verify root span covers full text
+    Assertions.assertEquals(new Span(0, 22), p.getSpan());
+
+    // Verify root type is INC
+    Assertions.assertEquals(Parser.INC_NODE, p.getType());
+
+    // Verify token children
+    Parse[] children = p.getChildren();
+    Assertions.assertEquals(tokens.length, children.length);
+
+    int start = 0;
+    for (int i = 0; i < tokens.length; i++) {
+      Assertions.assertEquals(Parser.TOK_NODE, children[i].getType());
+      Assertions.assertEquals(new Span(start, start + tokens[i].length()), 
children[i].getSpan());
+      Assertions.assertEquals(tokens[i], children[i].getCoveredText());
+      start += tokens[i].length() + 1;
+    }
+  }
+
+  @Test
+  void testCreateFromTokensNullThrows() {
+    Assertions.assertThrows(NullPointerException.class, () -> 
Parse.createFromTokens(null));

Review Comment:
   This needs adjustment to catch IllegalArgumentException instead.



##########
opennlp-api/src/main/java/opennlp/tools/parser/Parse.java:
##########
@@ -803,6 +803,42 @@ public static void fixPossesives(Parse parse) {
   }
 
 
+  /**
+   * Creates a {@link Parse} structure from an array of
+   * pre-tokenized strings.
+   * <p>
+   * This is a convenience factory method for cases
+   * where the input sentence is already tokenized
+   * (e.g., as a {@code String[]}). It joins the
+   * tokens with whitespace, computes character offset
+   * {@link Span spans} for each token, and builds the
+   * initial flat parse tree expected by
+   * {@link Parser#parse(Parse)}.
+   *
+   * @param tokens The tokens of the sentence.
+   * @return A flat {@link Parse} structure with token
+   *         nodes ready for a {@link Parser}.
+   * @throws IllegalArgumentException if {@code tokens}
+   *         is {@code null} or empty.
+   */
+  public static Parse createFromTokens(final String[] tokens) {
+    Objects.requireNonNull(tokens, "tokens must not be null");
+    if (tokens.length == 0) {
+      throw new IllegalArgumentException("tokens must not be empty");
+    }
+    String text = String.join(" ", tokens);
+    Parse p = new Parse(text,

Review Comment:
   Pls declare `Parse p` final here.



##########
opennlp-core/opennlp-cli/src/main/java/opennlp/tools/cmdline/parser/ParserTool.java:
##########
@@ -78,17 +74,25 @@ public static Parse[] parseLine(String line, Parser parser, 
Tokenizer tokenizer,
     line = UNTOKENIZED_PAREN_PATTERN_2.matcher(line).replaceAll("$1 $2");
 
     // tokenize
-    List<String> tokens = Arrays.asList( tokenizer.tokenize(line));
-    String text = String.join(" ", tokens);
-
-    Parse p = new Parse(text, new Span(0, text.length()), 
AbstractBottomUpParser.INC_NODE, 0, 0);
-    int start = 0;
-    int i = 0;
-    for (Iterator<String> ti = tokens.iterator(); ti.hasNext(); i++) {
-      String tok = ti.next();
-      p.insert(new Parse(text, new Span(start, start + tok.length()), 
AbstractBottomUpParser.TOK_NODE, 0, i));
-      start += tok.length() + 1;
-    }
+    String[] tokens = tokenizer.tokenize(line);
+    return parseLine(tokens, parser, numParses);
+  }
+
+  /**
+   * Parses the specified pre-tokenized sentence and returns the requested 
number of parses
+   * or fewer.
+   * <p>
+   * This is a convenience method for cases where the input has already been 
tokenized
+   * into individual tokens. It avoids re-tokenizing and the need to manually 
construct
+   * the whitespace-separated text and compute character offsets.
+   *
+   * @param tokens    The tokens of the sentence to parse.
+   * @param parser    The {@link Parser} to use.
+   * @param numParses The number of parses desired.
+   * @return The specified number of {@link Parse parses} for the given tokens.
+   */
+  public static Parse[] parseLine(String[] tokens, Parser parser, int 
numParses) {
+    Parse p = Parse.createFromTokens(tokens);

Review Comment:
   This line can throw IllegalArgumentException if tokens array is null or 
empty. Therefore, the javadoc should inform calllers of that fact. Please 
adjust the Javadoc to include `@ throws ...` accordingly.



##########
opennlp-api/src/main/java/opennlp/tools/parser/Parse.java:
##########
@@ -803,6 +803,42 @@ public static void fixPossesives(Parse parse) {
   }
 
 
+  /**
+   * Creates a {@link Parse} structure from an array of
+   * pre-tokenized strings.
+   * <p>
+   * This is a convenience factory method for cases
+   * where the input sentence is already tokenized
+   * (e.g., as a {@code String[]}). It joins the
+   * tokens with whitespace, computes character offset
+   * {@link Span spans} for each token, and builds the
+   * initial flat parse tree expected by
+   * {@link Parser#parse(Parse)}.
+   *
+   * @param tokens The tokens of the sentence.
+   * @return A flat {@link Parse} structure with token
+   *         nodes ready for a {@link Parser}.
+   * @throws IllegalArgumentException if {@code tokens}
+   *         is {@code null} or empty.
+   */
+  public static Parse createFromTokens(final String[] tokens) {
+    Objects.requireNonNull(tokens, "tokens must not be null");

Review Comment:
   This is not consistent with the Javadoc, as if a specified object reference 
is `null` and this call throws a `NullPointerException`. Hence this should be 
reworked to throw IllegalArgumentException as stated for callers of that method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-53: Add Parse.createFromTokens() for convenient tokenized input (opennlp)

Reply via email to