Author: rwesten
Date: Mon Sep 17 10:51:54 2012
New Revision: 1386548
URL: http://svn.apache.org/viewvc?rev=1386548&view=rev
Log:
added documentation for the language detection engine (STANBOL-707)
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext?rev=1386548&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langdetectengine.mdtext
Mon Sep 17 10:51:54 2012
@@ -0,0 +1,34 @@
+Title: The Language Detection Engine: detect the language of an text
+
+The **LangDetect** engine determines the language of text.
+
+## Technical Description
+
+The provided engine is based on the language identifier of
[language-detection](http://code.google.com/p/language-detection/) project.
+
+The plain text needed for the detection is retrieved from the processed
[ContentItem](../contentitem) by searching a Blob with the media type
"text/plain".
+
+The result of language identification is added as
[fise:TextAnnotation](../enhancementstructure.html#fisetextannotation) to the
content item's metadata as string value of the property
+
+ :::text
+ http://purl.org/dc/terms/language
+
+This RDF snippet illustrates the output:
+
+ :::xml
+ <fise:TextAnnotation
rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
+ <dc:language>en</dc:language>
+ <fise:confidence>0.99987</fise:confidence>
+ <dc:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
+
<dc:creator>org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</dc:creator>
+ </fise:TextAnnotation>
+
+
+The list of supported languages is available
[here](http://code.google.com/p/language-detection/wiki/LanguageList).
+
+
+## Configuration options
+
+* <code>org.apache.stanbol.enhancer.engines.langdetect.probe-length</code>: an
integer specifying how many characters will be used for identification. A value
of 0 or below means to use the complete text. Otherwise only a substring of the
specified length taken from the middle of the text will be used. **NOTE** that
the used library already supports random selection of text parts so typically
the probe-lengh feature should not be activated.
+* <code>org.apache.stanbol.enhancer.engines.langdetect.max-suggested</code>:
The used language detection library supports the annotation of multiple
languages. This allows to configure the maximum number of suggested languages.
+* <code>stanbol.enhancer.engine.name</code>: As with any EnhancementEngine
this property can be used to change the name of the Engine. The default is
"langdetect"
\ No newline at end of file
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext?rev=1386548&r1=1386547&r2=1386548&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.mdtext
Mon Sep 17 10:51:54 2012
@@ -2,6 +2,8 @@ Title: The Language Identification Engin
The **LangId** engine determines the language of text.
+*NOTE*: Users of this engine might want to consider using the
[LangDetect](langdetectengine) instead because the language detection library
used by this engine supports more languages and also delivers better results.
+
## Technical Description
The provided engine is based on the language identifier of [Apache
Tika](http://tika.apache.org/).
@@ -54,6 +56,7 @@ Additional language models can be create
## Configuration options
* <code>org.apache.stanbol.enhancer.engines.langid.probe-length</code>: an
integer specifying how many characters will be used for identification. A value
of 0 or below means to use the complete text. Otherwise only a substring of the
specified length taken from the middle of the text will be used. The default
value is 400 characters.
+* <code>stanbol.enhancer.engine.name</code>: As with any EnhancementEngine
this property can be used to change the name of the Engine. The default is
"langid"
## Usage
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext?rev=1386548&r1=1386547&r2=1386548&view=diff
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
(original)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.mdtext
Mon Sep 17 10:51:54 2012
@@ -6,6 +6,9 @@ This provides an overview about all [Enh
* __[Language Identification Engine](langidengine.html)__
* language detection for textual content utilizing [Apache
Tika](http://tika.apache.org/)
+
+* __[Language Detection Engine](langdetectengine.html)__
+ * language detection for textual content utilizing
[language-detection](http://code.google.com/p/language-detection/) Project
* __[Tika Engine](tikaengine.html)__ (based on [Apache
Tika](http://tika.apache.org/))
* content type detection