krickert commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3447269619
##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
File vocab = new File("/path/to/vocab.txt");
Map<Integer, String> categories = new HashMap<>();
String[] tokens = new String[]{"George", "Washington", "was", "president",
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false,
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(),
sentenceDetector);
Review Comment:
Fixed. The example now defines `ids2Labels` and the `SentenceDetector`,
drops the unused `categories` map, and uses `findInOriginal(...)`, the
offset-safe method.
##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
File vocab = new File("/path/to/vocab.txt");
Map<Integer, String> categories = new HashMap<>();
String[] tokens = new String[]{"George", "Washington", "was", "president",
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false,
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(),
sentenceDetector);
Span[] spans = nameFinderDL.find(tokens);]]>
</programlisting>
For additional examples, refer to the
<code>NameFinderDLEval</code> class.
</para>
+ <para>
+ Long input text is split into
overlapping chunks on the full Unicode
+ <code>White_Space</code> set before
WordPiece tokenization, so spacing such as a
+ no-break space or the CJK ideographic
space is recognized as a delimiter. After
+ inference, reconstructed entity text is
matched back to the caller's original input
+ with a Unicode-aware cursor scan (not a
regular expression), so
+ <code>Span#getCoveredText(...)</code>
returns the source text even when WordPiece
+ rejoins sub-tokens with spaces or when
the source uses non-ASCII whitespace between
+ tokens.
+ </para>
+ <para>
+ Optional preprocessing of the joined
input text is available through
+ <code>InferenceOptions</code> and is
off by default:
+
<code>setNormalizeWhitespace(true)</code> folds each Unicode whitespace
character to
+ an ASCII space, and
<code>setNormalizeDashes(true)</code> folds Unicode dashes to the
+ ASCII hyphen-minus. Both transforms are
one code point to one character and preserve
+ offsets. Full details, the underlying
<code>CharClass</code> engine, and the broader
+ normalization pipeline are documented
in <xref linkend="tools.normalizer"/>.
+ </para>
+ <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options,
sentenceDetector);]]>
Review Comment:
Fixed. The normalization example is now self-contained: it defines the
model, vocab, `ids2Labels`, `SentenceDetector`, and tokens, and uses
`findInOriginal(...)`, which maps spans back to original coordinates even when
a fold changes the input length.
##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category =
myCategorizer.getBestCategory(outcomes);]]>
</programlisting>
For additional examples, refer to the
<code>DocumentCategorizerDLEval</code> class.
</para>
+ <para>
+ Like <code>NameFinderDL</code>, long input is
split into overlapping chunks on the full
+ Unicode <code>White_Space</code> set rather
than Java's <code>\s</code>, so text copied
+ from PDFs, the web, or multilingual sources
tokenizes consistently. Optional
+ preprocessing through
<code>InferenceOptions</code> is off by default:
+ <code>setNormalizeWhitespace(true)</code> maps
each Unicode whitespace code point to an
+ ASCII space, and
<code>setNormalizeDashes(true)</code> maps Unicode dashes to the ASCII
+ hyphen-minus. Both are one-to-one replacements
that preserve character offsets. See
+ <xref linkend="tools.normalizer"/> for the
shared <code>CharClass</code> engine and the
+ full normalization library.
+ </para>
+ <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+DocumentCategorizerDL categorizer = new DocumentCategorizerDL(
+ model, vocab, categories, scoringStrategy, options);]]>
Review Comment:
Fixed. The snippet constructs `new AverageClassificationScoringStrategy()`
inline instead of referencing an undefined `scoringStrategy`.
##########
opennlp-docs/src/docbkx/doccat.xml:
##########
@@ -171,6 +171,24 @@ String category =
myCategorizer.getBestCategory(outcomes);]]>
</programlisting>
For additional examples, refer to the
<code>DocumentCategorizerDLEval</code> class.
</para>
+ <para>
Review Comment:
Fixed. Updated to the current constructor: `new DocumentCategorizerDL(model,
vocab, categories, new AverageClassificationScoringStrategy(), new
InferenceOptions())`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]