This is an automated email from the ASF dual-hosted git repository.
jzemerick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/opennlp.git
The following commit(s) were added to refs/heads/master by this push:
new 22ba38f OPENNLP-1189: Updating tokenizer input description. (#307)
22ba38f is described below
commit 22ba38f8a902de29e9bc4d40ec5da14fe31cdc8e
Author: Jeff Zemerick <[email protected]>
AuthorDate: Fri May 18 06:37:47 2018 -0400
OPENNLP-1189: Updating tokenizer input description. (#307)
---
opennlp-docs/src/docbkx/tokenizer.xml | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml
b/opennlp-docs/src/docbkx/tokenizer.xml
index 3fb4519..f1fae19 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -215,7 +215,8 @@ double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
available from the model download page on
various corpora. The data
can be converted to the OpenNLP Tokenizer
training format or used directly.
The OpenNLP format contains one sentence per line. Tokens are
either separated by a
- whitespace or by a special <SPLIT> tag.
+ whitespace or by a special <SPLIT> tag. Tokens are split
automaticaly on whitespace
+ and at least one <SPLIT> tag must be present in the
training text.
The following sample shows the sample from
above in the correct format.
<screen>
--
To stop receiving notification emails like this one, please contact
[email protected].