Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

via GitHub Sat, 20 Jun 2026 11:34:56 -0700


krickert commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3447269619



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
 File vocab = new File("/path/to/vocab.txt");
 Map<Integer, String> categories = new HashMap<>();
 String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), 
sentenceDetector);

Review Comment:
   Fixed. The example now defines `ids2Labels` and the `SentenceDetector`, 
drops the unused `categories` map, and uses `findInOriginal(...)`, the 
offset-safe method.
   



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
 File vocab = new File("/path/to/vocab.txt");
 Map<Integer, String> categories = new HashMap<>();
 String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), 
sentenceDetector);
 Span[] spans = nameFinderDL.find(tokens);]]>
                                        </programlisting>
                                        For additional examples, refer to the 
<code>NameFinderDLEval</code> class.
                                </para>
+                               <para>
+                                       Long input text is split into 
overlapping chunks on the full Unicode
+                                       <code>White_Space</code> set before 
WordPiece tokenization, so spacing such as a
+                                       no-break space or the CJK ideographic 
space is recognized as a delimiter. After
+                                       inference, reconstructed entity text is 
matched back to the caller's original input
+                                       with a Unicode-aware cursor scan (not a 
regular expression), so
+                                       <code>Span#getCoveredText(...)</code> 
returns the source text even when WordPiece
+                                       rejoins sub-tokens with spaces or when 
the source uses non-ASCII whitespace between
+                                       tokens.
+                               </para>
+                               <para>
+                                       Optional preprocessing of the joined 
input text is available through
+                                       <code>InferenceOptions</code> and is 
off by default:
+                                       
<code>setNormalizeWhitespace(true)</code> folds each Unicode whitespace 
character to
+                                       an ASCII space, and 
<code>setNormalizeDashes(true)</code> folds Unicode dashes to the
+                                       ASCII hyphen-minus. Both transforms are 
one code point to one character and preserve
+                                       offsets. Full details, the underlying 
<code>CharClass</code> engine, and the broader
+                                       normalization pipeline are documented 
in <xref linkend="tools.normalizer"/>.
+                               </para>
+                               <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, 
sentenceDetector);]]>

Review Comment:
   Fixed. The normalization example is now self-contained: it defines the 
model, vocab, `ids2Labels`, `SentenceDetector`, and tokens, and uses 
`findInOriginal(...)`, which maps spans back to original coordinates even when 
a fold changes the input length.
   



##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category = 
myCategorizer.getBestCategory(outcomes);]]>
                                </programlisting>
                                For additional examples, refer to the 
<code>DocumentCategorizerDLEval</code> class.
                        </para>
+                       <para>
+                               Like <code>NameFinderDL</code>, long input is 
split into overlapping chunks on the full
+                               Unicode <code>White_Space</code> set rather 
than Java's <code>\s</code>, so text copied
+                               from PDFs, the web, or multilingual sources 
tokenizes consistently. Optional
+                               preprocessing through 
<code>InferenceOptions</code> is off by default:
+                               <code>setNormalizeWhitespace(true)</code> maps 
each Unicode whitespace code point to an
+                               ASCII space, and 
<code>setNormalizeDashes(true)</code> maps Unicode dashes to the ASCII
+                               hyphen-minus. Both are one-to-one replacements 
that preserve character offsets. See
+                               <xref linkend="tools.normalizer"/> for the 
shared <code>CharClass</code> engine and the
+                               full normalization library.
+                       </para>
+                       <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+DocumentCategorizerDL categorizer = new DocumentCategorizerDL(
+    model, vocab, categories, scoringStrategy, options);]]>

Review Comment:
   Fixed. The snippet constructs `new AverageClassificationScoringStrategy()` 
inline instead of referencing an undefined `scoringStrategy`.
   



##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category = 
myCategorizer.getBestCategory(outcomes);]]>
                                </programlisting>
                                For additional examples, refer to the 
<code>DocumentCategorizerDLEval</code> class.
                        </para>
+                       <para>

Review Comment:
   Fixed. Updated to the current constructor: `new DocumentCategorizerDL(model, 
vocab, categories, new AverageClassificationScoringStrategy(), new 
InferenceOptions())`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

Reply via email to