Author: rwesten
Date: Mon Sep 17 10:51:54 2012
New Revision: 1386548

URL: http://svn.apache.org/viewvc?rev=1386548&view=rev
Log:
added documentation for the language detection engine (STANBOL-707)

Added:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
Modified:
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext?rev=1386548&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
 Mon Sep 17 10:51:54 2012
@@ -0,0 +1,34 @@
+Title: The Language Detection Engine: detect the language of an text
+
+The **LangDetect** engine determines the language of text.
+
+## Technical Description
+
+The provided engine is based on the language identifier of 
[language-detection](http://code.google.com/p/language-detection/) project.
+
+The plain text needed for the detection is retrieved from the processed 
[ContentItem](../contentitem) by searching a Blob with the media type 
"text/plain".
+
+The result of language identification is added as 
[fise:TextAnnotation](../enhancementstructure.html#fisetextannotation) to the 
content item's metadata as string value of the property
+
+    :::text
+    http://purl.org/dc/terms/language
+
+This RDF snippet illustrates the output:
+
+    :::xml
+    <fise:TextAnnotation 
rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
+        <dc:language>en</dc:language>
+        <fise:confidence>0.99987</fise:confidence>
+        <dc:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
+        
<dc:creator>org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</dc:creator>
+    </fise:TextAnnotation>
+
+
+The list of supported languages is available 
[here](http://code.google.com/p/language-detection/wiki/LanguageList).
+
+
+## Configuration options
+
+* <code>org.apache.stanbol.enhancer.engines.langdetect.probe-length</code>: an 
integer specifying how many characters will be used for identification. A value 
of 0 or below means to use the complete text. Otherwise only a substring of the 
specified length taken from the middle of the text will be used. **NOTE** that 
the used library already supports random selection of text parts so typically 
the probe-lengh feature should not be activated.
+* <code>org.apache.stanbol.enhancer.engines.langdetect.max-suggested</code>: 
The used language detection library supports the annotation of multiple 
languages. This allows to configure the maximum number of suggested languages.
+* <code>stanbol.enhancer.engine.name</code>: As with any EnhancementEngine 
this property can be used to change the name of the Engine. The default is 
"langdetect"
\ No newline at end of file

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext?rev=1386548&r1=1386547&r2=1386548&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
 Mon Sep 17 10:51:54 2012
@@ -2,6 +2,8 @@ Title: The Language Identification Engin
 
 The **LangId** engine determines the language of text.
 
+*NOTE*: Users of this engine might want to consider using the 
[LangDetect](langdetectengine) instead because the language detection library 
used by this engine supports more languages and also delivers better results.
+
 ## Technical Description
 
 The provided engine is based on the language identifier of [Apache 
Tika](http://tika.apache.org/).
@@ -54,6 +56,7 @@ Additional language models can be create
 ## Configuration options
 
 * <code>org.apache.stanbol.enhancer.engines.langid.probe-length</code>: an 
integer specifying how many characters will be used for identification. A value 
of 0 or below means to use the complete text. Otherwise only a substring of the 
specified length taken from the middle of the text will be used. The default 
value is 400 characters.
+* <code>stanbol.enhancer.engine.name</code>: As with any EnhancementEngine 
this property can be used to change the name of the Engine. The default is 
"langid"
 
 ## Usage
 

Modified: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext?rev=1386548&r1=1386547&r2=1386548&view=diff
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
 (original)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
 Mon Sep 17 10:51:54 2012
@@ -6,6 +6,9 @@ This provides an overview about all [Enh
 
 * __[Language Identification Engine](langidengine.html)__
        * language detection for textual content utilizing [Apache 
Tika](http://tika.apache.org/)
+
+* __[Language Detection Engine](langdetectengine.html)__
+       * language detection for textual content utilizing 
[language-detection](http://code.google.com/p/language-detection/) Project
        
 * __[Tika Engine](tikaengine.html)__ (based on [Apache 
Tika](http://tika.apache.org/))
        * content type detection


Reply via email to