This is an automated email from the ASF dual-hosted git repository.

mawiesne pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git


The following commit(s) were added to refs/heads/main by this push:
     new 4e2e7f26 OPENNLP-287: Extend POS Tagger documentation with more 
information about the tag dictionary (#754)
4e2e7f26 is described below

commit 4e2e7f267c358d1ac8039ffc581331b454f5e6cd
Author: NishantShri4 <[email protected]>
AuthorDate: Thu Mar 20 07:31:02 2025 +0000

    OPENNLP-287: Extend POS Tagger documentation with more information about 
the tag dictionary (#754)
---
 opennlp-docs/src/docbkx/postagger.xml | 111 ++++++++++++++++++++++++++++++++--
 1 file changed, 107 insertions(+), 4 deletions(-)

diff --git a/opennlp-docs/src/docbkx/postagger.xml 
b/opennlp-docs/src/docbkx/postagger.xml
index 71a0fc38..79e6ce5d 100644
--- a/opennlp-docs/src/docbkx/postagger.xml
+++ b/opennlp-docs/src/docbkx/postagger.xml
@@ -237,11 +237,114 @@ try (OutputStream modelOut = new 
BufferedOutputStream(new FileOutputStream(model
                </para>
                <para>
                The dictionary is defined in a xml format and can be created 
and stored with the POSDictionary class.
-               Please for now checkout the javadoc and source code of that 
class.
+               Below is an example to train a custom model using a tag 
dictionary.
                </para>
-               <para>Note: The format should be documented and sample code 
should show how to use the dictionary.
-                         Any contributions are very welcome. If you want to 
contribute please contact us on the mailing list
-                         or comment on the jira issue <ulink 
url="https://issues.apache.org/jira/browse/OPENNLP-287";>OPENNLP-287</ulink>.
+               <para>
+               Sample POS Training material (file : en-custom-pos.train)
+                       <screen>
+                               <![CDATA[
+It_PRON is_OTHER spring_PROPN season_NOUN. The_DET flowers_NOUN are_OTHER 
red_ADJ and_CCONJ yellow_ADJ ._PUNCT
+Red_NOUN is_OTHER my_DET favourite_ADJ colour_NOUN ._PUNCT]]>
+                       </screen>
+               </para>
+               <para>
+               Sample Tag Dictionary (file : dictionary.xml)
+                       <programlisting language="xml">
+                               <![CDATA[
+<?xml version="1.0" encoding="UTF-8"?>
+ <dictionary case_sensitive="false">
+  <entry tags="PRON">
+    <token>It</token>
+  </entry>
+  <entry tags="OTHER">
+    <token>is</token>
+  </entry>
+  <entry tags="PROPN">
+    <token>Spring</token>
+  </entry>
+  <entry tags="NOUN">
+    <token>season</token>
+  </entry>
+  <entry tags="DET">
+    <token>the</token>
+  </entry>
+  <entry tags="NOUN">
+    <token>flowers</token>
+  </entry>
+  <entry tags="OTHER">
+    <token>are</token>
+  </entry>
+  <entry tags="NOUN">
+    <token>red</token>
+  </entry>
+  <entry tags="CCONJ">
+    <token>and</token>
+  </entry>
+  <entry tags="NOUN">
+    <token>yellow</token>
+  </entry>
+  <entry tags="PRON">
+    <token>my</token>
+  </entry>
+  <entry tags="ADJ">
+    <token>favourite</token>
+  </entry>
+  <entry tags="NOUN">
+    <token>colour</token>
+  </entry>
+  <entry tags="PUNCT">
+    <token>.</token>
+  </entry>
+</dictionary>]]>
+                       </programlisting>
+               </para>
+               <para>Sample code to train a model using above tag dictionary
+                       <programlisting language="java">
+                       <![CDATA[
+POSModel model = null;
+       try {
+               ObjectStream<String> lineStream = new PlainTextByLineStream(
+                               new MarkableFileInputStreamFactory(new 
File("en-custom-pos.train")), StandardCharsets.UTF_8);
+
+               ObjectStream<POSSample> sampleStream = new 
WordTagSampleStream(lineStream);
+
+               TrainingParameters params = 
ModelUtil.createDefaultTrainingParameters();
+               params.put(TrainingParameters.CUTOFF_PARAM, 0);
+
+               POSTaggerFactory factory = new POSTaggerFactory();
+               TagDictionary dict = factory.createTagDictionary(new 
File("dictionary.xml"));
+               factory.setTagDictionary(dict);
+
+               model = POSTaggerME.train("eng", sampleStream, params, factory);
+
+               OutputStream modelOut = new BufferedOutputStream(new 
FileOutputStream("en-custom-pos-maxent.bin"));
+               model.serialize(modelOut);
+
+       } catch (IOException e) {
+               e.printStackTrace();
+       }]]>
+                       </programlisting>
+               </para>
+               <para>
+               The custom model is then used to tag a sequence.
+               <programlisting language="java">
+                       <![CDATA[
+String[] sent = new String[]{"Spring", "is", "my", "favourite", "season", "."};
+String[] tags = tagger.tag(sent);
+Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
+               </programlisting>
+               </para>
+               <para>
+                       <literallayout>
+                               Input
+                                   Sentence:   Spring is my favourite season.
+
+                               Output
+                                   POS Tags using the custom model 
(en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
+
+                               Output with the default model
+                                   POS Tags using the default model 
(opennlp-en-ud-ewt-pos-1.2-2.5.0.bin):     NOUN AUX PRON ADJ NOUN PUNCT
+                       </literallayout>
                </para>
                </section>
                </section>

Reply via email to