engines: kuromojinlp.mdtext textannotationnewmodel.mdtext

rwesten Wed, 20 Mar 2013 08:25:37 -0700

Author: rwesten
Date: Wed Mar 20 15:25:11 2013
New Revision: 1458884

URL: http://svn.apache.org/r1458884
Log:
Added documentation for the TextAnnotation new Model Enine (STANBOL-953) as 
well as the Kuromoji NLP engine for Japanese (STANBOL-980)


Added:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext?rev=1458884&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/kuromojinlp.mdtext
 Wed Mar 20 15:25:11 2013
@@ -0,0 +1,25 @@
+title: Kuromoji NLP Engine for Japanese
+
+[Kuromoji](http://www.atilika.org/) is a NLP Framework contributed to [Apache 
Lucene](http://lucene.apache.org). It is available starting with version 3.6.2 
and 4.1 of Solr/Lucene. In Stanbol it requires the use of a version newer than 
[revision 1458703](http://svn.apache.org/r1458703) as it only works for the 
stanbol.commons.solr modules compatible to Solr 4.1.
+
+
+## Consumed information
+
+* __Language__ (required): The language of the text needs to be available. It 
is read as specified by 
[STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the 
metadata of the ContentItem. Effectively this means that any Stanbol Language 
Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
+
+## Supported modules
+
+* __Sentences__ : Kuromoji itself does not provide sentence detection. Because 
of that the detection of sentences is done by using POS tagging results. The 
POS tag 'è¨å·-å¥ç¹' is used for splitting Sentences. Further it is assumed 
that each Text starts and ends with a complete sentence.
+* __Tokens__: Kuromoji is configured to provide tokens for all words and 
punctuation. This is done by configuring an empty stop tag list as well as 
setting the 'discardPunctuation' property to <code>false</code>
+* __POS tagging__: The POS tag set used by Kuromoji was mapped to the 
LexicalCategories and POS types as defined by the Stanbol NLP processing 
module. For the String tags the Japanese name is used (e.g. 
'åè©-ä»£åè©-ç¸®ç´' := Pos.Pronoun,Pos.Participle, description: 
noun-pronoun-contraction: Spoken language contraction made by combining a 
pronoun and the particle 'wa'. e.g. ããã, ããã, ãããã, 
ããã, ãããã )
+    POS tags are represented by adding _NlpAnnotations#POS_ANNOTATION_'s to 
the _Tokens_ of the _AnalyzedText_ content part. Kuromoji provides only a 
single POS tag per Token.
+* __NER detection__; The POS tag set used by Kuromoji defines POS tags 
describing named entities. Those POS tags are than combined to chunks and 
interpreted as named entities (e.g. 'åè©-åºæåè©-äººå-å§' 
noun-proper-person-surname; 'åè©-åºæåè©-äººå-å' 
noun-proper-person-given_name)
+    Named Entities are represented by adding _NlpAnnotations#NER_ANNOTATION_'s 
to the _Tokens_ of the _AnalyzedText_ content part. In addition also 
'fise:TextAnnotations' are added to the metadata of the ContentItem.
+
+### Confidence
+
+Kuromoji does not provide confidence values for results.
+
+## Configuration
+
+The engine does not provide any custom configuration. However it supports the 
configuration of the engine name.
\ No newline at end of file

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext?rev=1458884&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/textannotationnewmodel.mdtext
 Wed Mar 20 15:25:11 2013
@@ -0,0 +1,11 @@
+Title: TextAnnotation new Model Converter Engine
+
+This Engine converts 
'[fise:TextAnnotation](../enhancementstructure#fisetextannotation)' to include 
the 'fise:selection-prefix' and 'fise:selection-suffix' properties as 
introduced by [STANBOL-987](https://issues.apache.org/jira/browse/STANBOL-987).
+
+It processes all 'fise:TextAnnotation' that select a specific part of the 
text. Meaning that they define a 'fise:start' and 'fise:end' property. 
'fise:TextAnnotations' that do already define 'fise:selection-prefix' or 
'fise:selection-suffix' properties are skipped.
+
+## Configuration:
+
+Other than the configurations for the engines name and ranking this engine 
supports the following custom properties:
+
+* __Prefix Suffix Length__ 
_(enhancer.engines.textannotationnewmodel.prefixSuffixSize)_: Allows to change 
the char length of prefixes and suffixes. The default is <code>10</code>. The 
minimum allowed value is <code>3</code>
\ No newline at end of file

svn commit: r1458884 - in /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines: kuromojinlp.mdtext textannotationnewmodel.mdtext

Reply via email to