Author: rwesten
Date: Wed Mar 20 15:25:11 2013
New Revision: 1458884
URL: http://svn.apache.org/r1458884
Log:
Added documentation for the TextAnnotation new Model Enine (STANBOL-953) as
well as the Kuromoji NLP engine for Japanese (STANBOL-980)
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext?rev=1458884&view=auto
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
(added)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
Wed Mar 20 15:25:11 2013
@@ -0,0 +1,25 @@
+title: Kuromoji NLP Engine for Japanese
+
+[Kuromoji](http://www.atilika.org/) is a NLP Framework contributed to [Apache
Lucene](http://lucene.apache.org). It is available starting with version 3.6.2
and 4.1 of Solr/Lucene. In Stanbol it requires the use of a version newer than
[revision 1458703](http://svn.apache.org/r1458703) as it only works for the
stanbol.commons.solr modules compatible to Solr 4.1.
+
+
+## Consumed information
+
+* __Language__ (required): The language of the text needs to be available. It
is read as specified by
[STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the
metadata of the ContentItem. Effectively this means that any Stanbol Language
Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
+
+## Supported modules
+
+* __Sentences__ : Kuromoji itself does not provide sentence detection. Because
of that the detection of sentences is done by using POS tagging results. The
POS tag 'è¨å·-å¥ç¹' is used for splitting Sentences. Further it is assumed
that each Text starts and ends with a complete sentence.
+* __Tokens__: Kuromoji is configured to provide tokens for all words and
punctuation. This is done by configuring an empty stop tag list as well as
setting the 'discardPunctuation' property to <code>false</code>
+* __POS tagging__: The POS tag set used by Kuromoji was mapped to the
LexicalCategories and POS types as defined by the Stanbol NLP processing
module. For the String tags the Japanese name is used (e.g.
'åè©-代åè©-縮ç´' := Pos.Pronoun,Pos.Participle, description:
noun-pronoun-contraction: Spoken language contraction made by combining a
pronoun and the particle 'wa'. e.g. ããã, ããã, ãããã,
ããã, ãããã )
+ POS tags are represented by adding _NlpAnnotations#POS_ANNOTATION_'s to
the _Tokens_ of the _AnalyzedText_ content part. Kuromoji provides only a
single POS tag per Token.
+* __NER detection__; The POS tag set used by Kuromoji defines POS tags
describing named entities. Those POS tags are than combined to chunks and
interpreted as named entities (e.g. 'åè©-åºæåè©-人å-å§'
noun-proper-person-surname; 'åè©-åºæåè©-人å-å'
noun-proper-person-given_name)
+ Named Entities are represented by adding _NlpAnnotations#NER_ANNOTATION_'s
to the _Tokens_ of the _AnalyzedText_ content part. In addition also
'fise:TextAnnotations' are added to the metadata of the ContentItem.
+
+### Confidence
+
+Kuromoji does not provide confidence values for results.
+
+## Configuration
+
+The engine does not provide any custom configuration. However it supports the
configuration of the engine name.
\ No newline at end of file
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext?rev=1458884&view=auto
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
(added)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
Wed Mar 20 15:25:11 2013
@@ -0,0 +1,11 @@
+Title: TextAnnotation new Model Converter Engine
+
+This Engine converts
'[fise:TextAnnotation](../enhancementstructure#fisetextannotation)' to include
the 'fise:selection-prefix' and 'fise:selection-suffix' properties as
introduced by [STANBOL-987](https://issues.apache.org/jira/browse/STANBOL-987).
+
+It processes all 'fise:TextAnnotation' that select a specific part of the
text. Meaning that they define a 'fise:start' and 'fise:end' property.
'fise:TextAnnotations' that do already define 'fise:selection-prefix' or
'fise:selection-suffix' properties are skipped.
+
+## Configuration:
+
+Other than the configurations for the engines name and ranking this engine
supports the following custom properties:
+
+* __Prefix Suffix Length__
_(enhancer.engines.textannotationnewmodel.prefixSuffixSize)_: Allows to change
the char length of prefixes and suffixes. The default is <code>10</code>. The
minimum allowed value is <code>3</code>
\ No newline at end of file