Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

via GitHub Sat, 20 Jun 2026 08:02:32 -0700


Copilot commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3446750204



##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category = 
myCategorizer.getBestCategory(outcomes);]]>
                                </programlisting>
                                For additional examples, refer to the 
<code>DocumentCategorizerDLEval</code> class.
                        </para>
+                       <para>
+                               Like <code>NameFinderDL</code>, long input is 
split into overlapping chunks on the full
+                               Unicode <code>White_Space</code> set rather 
than Java's <code>\s</code>, so text copied
+                               from PDFs, the web, or multilingual sources 
tokenizes consistently. Optional
+                               preprocessing through 
<code>InferenceOptions</code> is off by default:
+                               <code>setNormalizeWhitespace(true)</code> maps 
each Unicode whitespace code point to an
+                               ASCII space, and 
<code>setNormalizeDashes(true)</code> maps Unicode dashes to the ASCII
+                               hyphen-minus. Both are one-to-one replacements 
that preserve character offsets. See
+                               <xref linkend="tools.normalizer"/> for the 
shared <code>CharClass</code> engine and the
+                               full normalization library.
+                       </para>
+                       <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+DocumentCategorizerDL categorizer = new DocumentCategorizerDL(
+    model, vocab, categories, scoringStrategy, options);]]>

Review Comment:
   This added ONNX example uses `scoringStrategy` without defining it, so the 
snippet won’t compile. Define a concrete `ClassificationScoringStrategy` (e.g., 
`AverageClassificationScoringStrategy`) in the listing before constructing the 
`DocumentCategorizerDL`.



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
 File vocab = new File("/path/to/vocab.txt");
 Map<Integer, String> categories = new HashMap<>();
 String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), 
sentenceDetector);

Review Comment:
   The ONNX example declares a `categories` map that is unused and references 
`sentenceDetector` without defining it, so the snippet won’t compile. Since 
`NameFinderDL` requires both an `ids2Labels` map and a `SentenceDetector`, 
define those explicitly here (even if the map is a placeholder to be filled 
with the model’s BIO labels).



##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category = 
myCategorizer.getBestCategory(outcomes);]]>
                                </programlisting>
                                For additional examples, refer to the 
<code>DocumentCategorizerDLEval</code> class.
                        </para>
+                       <para>

Review Comment:
   The ONNX usage example just above still uses `new 
DocumentCategorizerDL(model, vocab, categories)`, but `DocumentCategorizerDL` 
no longer has a 3-arg constructor; it requires a 
`ClassificationScoringStrategy` and `InferenceOptions`. Updating the example 
avoids a non-compiling snippet in the docs.



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
 File vocab = new File("/path/to/vocab.txt");
 Map<Integer, String> categories = new HashMap<>();
 String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), 
sentenceDetector);
 Span[] spans = nameFinderDL.find(tokens);]]>
                                        </programlisting>
                                        For additional examples, refer to the 
<code>NameFinderDLEval</code> class.
                                </para>
+                               <para>
+                                       Long input text is split into 
overlapping chunks on the full Unicode
+                                       <code>White_Space</code> set before 
WordPiece tokenization, so spacing such as a
+                                       no-break space or the CJK ideographic 
space is recognized as a delimiter. After
+                                       inference, reconstructed entity text is 
matched back to the caller's original input
+                                       with a Unicode-aware cursor scan (not a 
regular expression), so
+                                       <code>Span#getCoveredText(...)</code> 
returns the source text even when WordPiece
+                                       rejoins sub-tokens with spaces or when 
the source uses non-ASCII whitespace between
+                                       tokens.
+                               </para>
+                               <para>
+                                       Optional preprocessing of the joined 
input text is available through
+                                       <code>InferenceOptions</code> and is 
off by default:
+                                       
<code>setNormalizeWhitespace(true)</code> folds each Unicode whitespace 
character to
+                                       an ASCII space, and 
<code>setNormalizeDashes(true)</code> folds Unicode dashes to the
+                                       ASCII hyphen-minus. Both transforms are 
one code point to one character and preserve
+                                       offsets. Full details, the underlying 
<code>CharClass</code> engine, and the broader
+                                       normalization pipeline are documented 
in <xref linkend="tools.normalizer"/>.
+                               </para>
+                               <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, 
sentenceDetector);]]>

Review Comment:
   This normalization example references `model`, `vocab`, `ids2Labels`, and 
`sentenceDetector` without defining them in the code block, which makes the 
snippet non-runnable as shown. Consider making the snippet self-contained by 
defining these values (or clearly marking them as placeholders) within the same 
listing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

Reply via email to