This is an automated email from the ASF dual-hosted git repository.
mawiesne pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git
The following commit(s) were added to refs/heads/main by this push:
new 1d722007 OPENNLP-1660: Switch to pre-trained UD models in Dev Manual
(#702)
1d722007 is described below
commit 1d7220073fe5e1c92fa27faeeee3180cf5c7fd51
Author: Martin Wiesner <[email protected]>
AuthorDate: Tue Dec 3 05:00:58 2024 +0100
OPENNLP-1660: Switch to pre-trained UD models in Dev Manual (#702)
---
opennlp-docs/src/docbkx/langdetect.xml | 4 +-
opennlp-docs/src/docbkx/lemmatizer.xml | 116 ++++++++++++++++-----------------
opennlp-docs/src/docbkx/postagger.xml | 21 +++---
opennlp-docs/src/docbkx/sentdetect.xml | 17 +++--
opennlp-docs/src/docbkx/tokenizer.xml | 22 +++----
5 files changed, 88 insertions(+), 92 deletions(-)
diff --git a/opennlp-docs/src/docbkx/langdetect.xml
b/opennlp-docs/src/docbkx/langdetect.xml
index 865f5f94..7e901714 100644
--- a/opennlp-docs/src/docbkx/langdetect.xml
+++ b/opennlp-docs/src/docbkx/langdetect.xml
@@ -147,7 +147,7 @@ lav Egija Tri-Active procedūru īpaši iesaka izmantot
siltākajos gadalaik
<section id="tools.langdetect.training.tool">
<title>Training Tool</title>
<para>
- The following command will train the language
detector and write the model to langdetect.bin:
+ The following command will train the language
detector and write the model to langdetect-custom.bin:
<screen>
<![CDATA[
$ bin/opennlp LanguageDetectorTrainer[.leipzig] -model modelFile [-params
paramsFile] \
@@ -214,7 +214,7 @@ params.put(TrainingParameters.CUTOFF_PARAM, 0);
LanguageDetectorFactory factory = new LanguageDetectorFactory();
LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, params,
factory);
-model.serialize(new File("langdetect.bin"));]]>
+model.serialize(new File("langdetect-custom.bin"));]]>
</programlisting>
</para>
</section>
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml
b/opennlp-docs/src/docbkx/lemmatizer.xml
index 44356e04..4c07b66a 100644
--- a/opennlp-docs/src/docbkx/lemmatizer.xml
+++ b/opennlp-docs/src/docbkx/lemmatizer.xml
@@ -41,31 +41,31 @@
<para>
<screen>
<![CDATA[
-$ opennlp LemmatizerME en-lemmatizer.bin < sentences]]>
+$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin < sentences]]>
</screen>
The Lemmatizer now reads a pos tagged
sentence(s) per line from
standard input. For example, you can copy this
sentence to the
console:
<screen>
<![CDATA[
-Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD
it_PRP
-signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN
with_IN
-Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP
's_POS
-747_CD jetliners_NNS ._.]]>
+Rockwell_PROPN International_ADJ Corp_NOUN 's_PUNCT Tulsa_PROPN unit_NOUN
said_VERB it_PRON
+signed_VERB a_DET tentative_NOUN agreement_NOUN extending_VERB its_PRON
contract_NOUN
+with_ADP Boeing_PROPN Co._NOUN to_PART provide_VERB structural_ADJ parts_NOUN
for_ADP
+Boeing_PROPN 's_PUNCT 747_NUM jetliners_NOUN ._PUNCT]]>
</screen>
The Lemmatizer will now echo the lemmas for
each word postag pair to
the console:
<screen>
<![CDATA[
-Rockwell NNP rockwell
-International NNP international
-Corp. NNP corp.
-'s POS 's
-Tulsa NNP tulsa
-unit NN unit
-said VBD say
-it PRP it
-signed VBD sign
+Rockwell PROPN rockwell
+International ADJ international
+Corp NOUN corp
+'s PUNCT 's
+Tulsa PROPN tulsa
+unit NOUN unit
+said VERB say
+it PRON it
+signed VERB sign
...
]]>
</screen>
@@ -89,7 +89,7 @@ signed VBD sign
<programlisting language="java">
<![CDATA[
LemmatizerModel model = null;
-try (InputStream modelIn = new FileInputStream("en-lemmatizer.bin"))) {
+try (InputStream modelIn = new
FileInputStream("opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin"))) {
model = new LemmatizerModel(modelIn);
}
]]>
@@ -116,10 +116,10 @@ String[] tokens = new String[] { "Rockwell",
"International", "Corp.", "'s",
"provide", "structural", "parts", "for", "Boeing", "'s", "747",
"jetliners", "." };
-String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN",
- "VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN",
- "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
- "." };
+String[] postags = new String[] { "PROPN", "ADJ", "NOUN", "PUNCT", "PROPN",
"NOUN",
+ "VERB", "PRON", "VERB", "DET", "NOUN", "NOUN", "VERB", "PRON", "NOUN",
"ADP",
+ "PROPN", "NOUN", "PART", "VERB", "ADJ", "NOUN", "ADP", "PROPN", "PUNCT",
"NUM", "NOUN",
+ "PUNCT" };
String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]>
</programlisting>
@@ -136,20 +136,20 @@ String[] lemmas = lemmatizer.lemmatize(tokens,
postags);]]>
corresponding lemma, each column separated by a
tab character.
<screen>
<![CDATA[
-show NN show
-showcase NN showcase
-showcases NNS showcase
-showdown NN showdown
-showdowns NNS showdown
-shower NN shower
-showers NNS shower
-showman NN showman
-showmanship NN showmanship
-showmen NNS showman
-showroom NN showroom
-showrooms NNS showroom
-shows NNS show
-shrapnel NN shrapnel
+show NOUN show
+showcase NOUN showcase
+showcases NOUN showcase
+showdown NOUN showdown
+showdowns NOUN showdown
+shower NOUN shower
+showers NOUN shower
+showman NOUN showman
+showmanship NOUN showmanship
+showmen NOUN showman
+showroom NOUN showroom
+showrooms NOUN showroom
+shows NOUN show
+shrapnel NOUN shrapnel
]]>
</screen>
Alternatively, if a (word,postag) pair can
output multiple lemmas, the
@@ -157,10 +157,10 @@ shrapnel NN shrapnel
each row, a word, its postag and the
corresponding lemmas separated by "#":
<screen>
<![CDATA[
-muestras NN muestra
-cantaba V cantar
-fue V ir#ser
-entramos V entrar
+muestras NOUN muestra
+cantaba VERB cantar
+fue VERB ir#ser
+entramos VERB entrar
]]>
</screen>
First the dictionary must be loaded into memory
from disk or another
@@ -170,7 +170,7 @@ entramos V entrar
<![CDATA[
InputStream dictLemmatizer = null;
-try (dictLemmatizer = new FileInputStream("english-lemmatizer.txt")) {
+try (dictLemmatizer = new FileInputStream("english-dict-lemmatizer.txt")) {
}
]]>
@@ -217,22 +217,22 @@ String[] lemmas = lemmatizer.lemmatize(tokens, postags);
Sample sentence of the training data:
<screen>
<![CDATA[
-He PRP he
-reckons VBZ reckon
-the DT the
-current JJ current
-accounts NNS account
-deficit NN deficit
-will MD will
-narrow VB narrow
-to TO to
-only RB only
+He PRON he
+reckons VERB reckon
+the DET the
+current ADJ current
+accounts NOUN account
+deficit NOUN deficit
+will AUX will
+narrow VERB narrow
+to PART to
+only ADV only
# # #
-1.8 CD 1.8
-millions CD million
-in IN in
-September NNP september
-. . O]]>
+1.8 NUM 1.8
+millions NOUN million
+in ADP in
+September PROPN september
+. PUNCT O]]>
</screen>
The Universal Dependencies Treebank and the
CoNLL 2009 datasets
distribute training data for many languages.
@@ -267,11 +267,11 @@ Arguments description:
</screen>
Its now assumed that the english
lemmatizer model should be trained
from a file called
- 'en-lemmatizer.train' which is encoded
as UTF-8. The following command will train the
- lemmatizer and write the model to
en-lemmatizer.bin:
+ 'en-custom-lemmatizer.train' which is
encoded as UTF-8. The following command will train the
+ lemmatizer and write the model to
en-custom-lemmatizer.bin:
<screen>
<![CDATA[
-$ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params
PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding
UTF-8]]>
+$ opennlp LemmatizerTrainerME -model en-custom-lemmatizer.bin -params
PerceptronTrainerParams.txt -lang en -data en-custom-lemmatizer.train -encoding
UTF-8]]>
</screen>
</para>
</section>
@@ -294,7 +294,7 @@ $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin
-params PerceptronTrainer
InputStreamFactory inputStreamFactory = null;
try {
inputStreamFactory = new MarkableFileInputStreamFactory(
- new File(en-lemmatizer.train));
+ new File(en-custom-lemmatizer.train));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
@@ -345,7 +345,7 @@ InputStreamFactory inputStreamFactory = null;
The following command shows how the
tool can be run:
<screen>
<![CDATA[
-$ opennlp LemmatizerEvaluator -model en-lemmatizer.bin -data
en-lemmatizer.test -encoding utf-8]]>
+$ opennlp LemmatizerEvaluator -model en-custom-lemmatizer.bin -data
en-custom-lemmatizer.test -encoding utf-8]]>
</screen>
This will display the resulting
accuracy score, e.g.:
<screen>
diff --git a/opennlp-docs/src/docbkx/postagger.xml
b/opennlp-docs/src/docbkx/postagger.xml
index 5f045e4f..71a0fc38 100644
--- a/opennlp-docs/src/docbkx/postagger.xml
+++ b/opennlp-docs/src/docbkx/postagger.xml
@@ -41,7 +41,7 @@ under the License.
Download the English maxent pos model and start the POS Tagger
Tool with this command:
<screen>
<![CDATA[
-$ opennlp POSTagger en-pos-maxent.bin]]>
+$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.2-2.5.0.bin]]>
</screen>
The POS Tagger now reads a tokenized sentence per line from
stdin.
Copy these two sentences to the console:
@@ -53,9 +53,9 @@ Mr. Vinken is chairman of Elsevier N.V. , the Dutch
publishing group .]]>
The POS Tagger will now echo the sentences with pos tags to
the console:
<screen>
<![CDATA[
-Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB the_DT
board_NN as_IN
- a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._.
-Mr._NNP Vinken_NNP is_VBZ chairman_NN of_IN Elsevier_NNP N.V._NNP ,_, the_DT
Dutch_NNP publishing_VBG group_NN]]>
+Pierre_PROPN Vinken_PROPN ,_PUNCT 61_NUM years_NOUN old_ADJ ,_PUNCT will_AUX
join_VERB the_DET board_NOUN as_ADP
+ a_DET nonexecutive_ADJ director_NOUN Nov._PROPN 29_NUM ._PUNCT
+Mr._PROPN Vinken_PROPN is_AUX chairman_NOUN of_ADP Elsevier_ADJ N.V._PROPN
,_PUNCT the_DET Dutch_PROPN publishing_VERB group_NOUN .]]>
</screen>
The tag set used by the English pos model is the <ulink
url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn
Treebank tag set</ulink>.
</para>
@@ -69,7 +69,7 @@ Mr._NNP Vinken_NNP is_VBZ chairman_NN of_IN Elsevier_NNP
N.V._NNP ,_, the_DT Dut
In the sample below it is loaded from disk.
<programlisting language="java">
<![CDATA[
-try (InputStream modelIn = new FileInputStream("en-pos-maxent.bin"){
+try (InputStream modelIn = new
FileInputStream("opennlp-en-ud-ewt-pos-1.2-2.5.0.bin"){
POSModel model = new POSModel(modelIn);
}]]>
</programlisting>
@@ -125,8 +125,8 @@ Sequence[] topSequences = tagger.topKSequences(sent);]]>
The native POS Tagger training material looks like this:
<screen>
<![CDATA[
-About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
-That_DT sounds_VBZ good_JJ ._.]]>
+About_ADV 10_NUM Euro_PROPN ,_PUNCT I_PRON reckon._PUNCT
+That_PRON sounds_VERB good_ADJ ._PUNCT]]>
</screen>
Each sentence must be in one line. The token/tag pairs
are combined with "_".
The token/tag pairs are whitespace separated. The data
format does not
@@ -180,8 +180,8 @@ Arguments description:
The following command illustrates how an English
part-of-speech model can be trained:
<screen>
<![CDATA[
-$ opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \
- -lang en -data en-pos.train -encoding UTF-8]]>
+$ opennlp POSTaggerTrainer -type maxent -model en-custom-pos-maxent.bin \
+ -lang en -data en-custom-pos.train -encoding
UTF-8]]>
</screen>
</para>
</section>
@@ -207,7 +207,8 @@ $ opennlp POSTaggerTrainer -type maxent -model
en-pos-maxent.bin \
POSModel model = null;
try {
- ObjectStream<String> lineStream = new PlainTextByLineStream(new
MarkableFileInputStreamFactory(new File("en-pos.train")),
StandardCharsets.UTF_8);
+ ObjectStream<String> lineStream = new PlainTextByLineStream(
+ new MarkableFileInputStreamFactory(new
File("en-custom-pos-maxent.bin")), StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
diff --git a/opennlp-docs/src/docbkx/sentdetect.xml
b/opennlp-docs/src/docbkx/sentdetect.xml
index 4e3a1db6..11b047d3 100644
--- a/opennlp-docs/src/docbkx/sentdetect.xml
+++ b/opennlp-docs/src/docbkx/sentdetect.xml
@@ -63,13 +63,13 @@ Rudolph Agnew, 55 years old and former chairman of
Consolidated Gold Fields PLC,
Download the english sentence detector model and start the
Sentence Detector Tool with this command:
<screen>
<![CDATA[
-$ opennlp SentenceDetector en-sent.bin]]>
+$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin]]>
</screen>
Just copy the sample text from above to the console. The
Sentence Detector will read it and echo one sentence per line to the console.
Usually the input is read from a file and the output is
redirected to another file. This can be achieved with the following command.
<screen>
<![CDATA[
-$ opennlp SentenceDetector en-sent.bin < input.txt > output.txt]]>
+$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin <
input.txt > output.txt]]>
</screen>
For the english sentence model from the website the input text
should not be tokenized.
</para>
@@ -81,8 +81,7 @@ $ opennlp SentenceDetector en-sent.bin < input.txt >
output.txt]]>
To instantiate the Sentence Detector the sentence model must be
loaded first.
<programlisting language="java">
<![CDATA[
-
-try (InputStream modelIn = new FileInputStream("en-sent.bin")) {
+try (InputStream modelIn = new
FileInputStream("opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
SentenceModel model = new SentenceModel(modelIn);
}]]>
</programlisting>
@@ -148,7 +147,7 @@ Arguments description:
To train an English sentence detector use the following command:
<screen>
<![CDATA[
-$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data
en-sent.train -encoding UTF-8
+$ opennlp SentenceDetectorTrainer -model en-custom-sent.bin -lang en -data
en-custom-sent.train -encoding UTF-8
]]>
</screen>
It should produce the following output:
@@ -183,7 +182,7 @@ Performing 100 iterations.
99: .. loglikelihood=-284.24296917223916 0.9834118369854598
100: .. loglikelihood=-283.2785335773966 0.9834118369854598
Wrote sentence detector model.
-Path: en-sent.bin
+Path: en-custom-sent.bin
]]>
</screen>
</para>
@@ -209,7 +208,7 @@ Path: en-sent.bin
<![CDATA[
ObjectStream<String> lineStream =
- new PlainTextByLineStream(new MarkableFileInputStreamFactory(new
File("en-sent.train")), StandardCharsets.UTF_8);
+ new PlainTextByLineStream(new MarkableFileInputStreamFactory(new
File("en-custom-sent.train")), StandardCharsets.UTF_8);
SentenceModel model;
@@ -235,7 +234,7 @@ try (OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(model
The command shows how the evaluator tool can be run:
<screen>
<![CDATA[
-$ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval
-encoding UTF-8
+$ opennlp SentenceDetectorEvaluator -model en-custom-sent.bin -data
en-custom-sent.eval -encoding UTF-8
Loading model ... done
Evaluating ... done
@@ -244,7 +243,7 @@ Precision: 0.9465737514518002
Recall: 0.9095982142857143
F-Measure: 0.9277177006260672]]>
</screen>
- The en-sent.eval file has the same format as the training data.
+ The en-custom-sent.eval file has the same format as the
training data.
</para>
</section>
</section>
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml
b/opennlp-docs/src/docbkx/tokenizer.xml
index 3627d825..c68b5ced 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -66,18 +66,15 @@ A form of asbestos once used to make Kent cigarette filters
has caused a high
Most part-of-speech taggers, parsers and so on, work
with text
tokenized in this manner. It is important to ensure
that your
- tokenizer
- produces tokens of the type expected by your later text
- processing
- components.
+ tokenizer produces tokens of the type expected by your
later text
+ processing components.
</para>
<para>
With OpenNLP (as with many systems), tokenization is a
two-stage
process:
first, sentence boundaries are identified, then tokens
within
- each
- sentence are identified.
+ each sentence are identified.
</para>
<section id="tools.tokenizer.cmdline">
@@ -100,7 +97,7 @@ $ opennlp SimpleTokenizer]]>
our website.
<screen>
<![CDATA[
-$ opennlp TokenizerME en-token.bin]]>
+$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]>
</screen>
To test the tokenizer copy the sample from above to the
console. The
whitespace separated tokens will be written back to the
@@ -110,7 +107,7 @@ $ opennlp TokenizerME en-token.bin]]>
Usually the input is read from a file and written to a
file.
<screen>
<![CDATA[
-$ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt]]>
+$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin < article.txt >
article-tokenized.txt]]>
</screen>
It can be done in the same way for the Simple Tokenizer.
</para>
@@ -154,8 +151,7 @@ London share prices were bolstered largely by continued
gains on Wall Street and
can be loaded.
<programlisting language="java">
<![CDATA[
-
-try (InputStream modelIn = new FileInputStream("en-token.bin")) {
+try (InputStream modelIn = new
FileInputStream("opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin")) {
TokenizerModel model = new TokenizerModel(modelIn);
}]]>
</programlisting>
@@ -258,7 +254,7 @@ Arguments description:
To train the english tokenizer use the
following command:
<screen>
<![CDATA[
-$ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt true -lang en
-data en-token.train -encoding UTF-8
+$ opennlp TokenizerTrainer -model en-custom-token.bin -alphaNumOpt true -lang
en -data en-custom-token.train -encoding UTF-8
Indexing events with TwoPass using cutoff of 5
@@ -291,7 +287,7 @@ Performing 100 iterations.
Writing tokenizer model ... done (0,044s)
Wrote tokenizer model to
-Path: en-token.bin]]>
+Path: en-custom-token.bin]]>
</screen>
</para>
</section>
@@ -314,7 +310,7 @@ Path: en-token.bin]]>
The following sample code illustrates these steps:
<programlisting language="java">
<![CDATA[
-ObjectStream<String> lineStream = new PlainTextByLineStream(new
MarkableFileInputStreamFactory(new File("en-sent.train")),
+ObjectStream<String> lineStream = new PlainTextByLineStream(new
MarkableFileInputStreamFactory(new File("en-custom-sent.train")),
StandardCharsets.UTF_8);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);