engines: fstengine-config-fstfolder.png lucenefstlinking.mdtext

rwesten Thu, 03 Oct 2013 06:03:15 -0700

Author: rwesten
Date: Thu Oct  3 13:02:32 2013
New Revision: 1528838

URL: http://svn.apache.org/r1528838
Log:
STANBOL-1128: minor updates and corrections to the FST linking engine 
documentation


Modified:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png?rev=1528838&r1=1528837&r2=1528838&view=diff
==============================================================================
Binary files - no diff available.

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext?rev=1528838&r1=1528837&r2=1528838&view=diff
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
 (original)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
 Thu Oct  3 13:02:32 2013
@@ -49,12 +49,12 @@ The Field Name Encodings work well with 
 This is the full list of supported Field encodings:
 
 * SolrYard: This supports the encoding use by the Stanbol Entityhub SolrYard 
implementation to encode RDF data types and language literals. If you configure 
the FST Linking Engine for a Solr index build for the SolrYard you need to use 
this encoding
-* MinusPrefix: {lang}-{field} (e.g. "en-name")
-* UnderscorePrefix: {lang}_{field} (e.g. "en_name")
-* AtPrefix: {lang}@{field} (e.g. "en@name")
-* MinusSuffix: {field}-{lang} (e.g. "name-en")
-* UnderscoreSuffix: {field}-{lang} (e.g. "name_en")
-* AtSuffix: {field}-{lang} (e.g. "name@en")
+* MinusPrefix: `{lang}-{field}` (e.g. "en-name")
+* UnderscorePrefix: `{lang}_{field}` (e.g. "en_name")
+* AtPrefix: `{lang}@{field}` (e.g. "en@name")
+* MinusSuffix: `{field}-{lang}` (e.g. "name-en")
+* UnderscoreSuffix: `{field}-{lang}` (e.g. "name_en")
+* AtSuffix: `{field}-{lang}` (e.g. "name@en")
 * None: In this case no prefix/suffix rewriting of configured `field` and 
`store` values is done. This means that the FST Configuration MUST define the 
exact field names in the Solr index for every configured language.
 
 #### FST Tagging Configuration
@@ -79,7 +79,7 @@ The following parameters are supported b
 
 * __field__: The indexed field in the configured Solr index. In multilingual 
scenarios this might be the 'base name' of the field that is extended by a 
prefix or suffix to get the actual field name in the Solr index (see also the 
field encoding configuration)
 * __stored__ (default: _field_ value) : The field in the Solr index with the 
stored label information. This parameter is optional. If not present `stored` 
is assumed to be equals to `field`.
-* __fst__ (default based on _field_ value): Optionally allows to manually 
specify the base file name of the FST models. Those files are assumed within 
the data directory of the configured Solr index under `fst/{fst}.{lang}.fst`. 
By default the configured `field` name is used (with non alpha-numeric chars 
replaced by '_').If runtime creation is enabled those files will be created if 
not present.
+* __fst__ (default based on _field_ value): This parameter allows to specify 
the name of the FST file stored within the FST directory (as configured by the 
[FST storage location]. The default name is generated by using the `field` with 
non alpha-numeric chars replaced by '_').
 * __generate__ (default: false): If enabled the Engine will generate missing 
FST models. If this is enabled the engine will also be able to update FST 
models after changes to the Solr Index. __NOTE__ that the creation of FST 
models is an expensive operation (both CPU and memory wise). The FST engine 
uses a pool of low priority threads to create FST models. The size of the pool 
can be configured by using the 
`enhancer.engines.linking.lucenefst.fstThreadPoolSize` parameter. Because of 
this the default is `false`.
 
 A more advanced Configuration might look like:
@@ -127,14 +127,16 @@ The configuration options does support p
 * `solr-server-name`: the name of the 
[ReferencedSolrServer](/docs/trunk/utils/commons-solr#referencedsolrserver) or 
[ManagedSolrServer](/docs/trunk/utils/commons-solr#managedsolrserver) holding 
the SolrCore (see also [Configuration of the Solr Index]
 * `solr-core-name` : the name of the SolrCore
 
-The default value of this property is `${solr-data-dir}/fst`. To manage FST 
models within the Stanbol folder you can us e.g. 
`${sling.home}/fst/${solr-server-name}/solr-core-name`.
+The default value of this property is '`${solr-data-dir}/fst`'. To manage FST 
models within the Stanbol folder you can us e.g. 
'`${sling.home}/fst/${solr-server-name}/solr-core-name`'.
 
 
 ### Entity Cache Configuration
 
 While FST tagging is fully done in-memory the FST linking engine needs to read 
information of matching Entities from the Solr index. This requires disc IO and 
is typically the part of the process that consumes the most time. The Entity 
Cache tries to prevent such disc level IO by caching SolrDocuments containing 
only fields required for the linking process (labels, types and (if available) 
entity rankings).  To further reduce memory requirements only labels in 
languages requested by processed ContentItems are stored in the cache. The 
Cache uses the LRU semantic and is based on the Solr cache implementation.
 
-The size of the cache can be configured by using the 
`enhancer.engines.linking.lucenefst.entityCacheSize` parameter. The default 
size is ~65k entities. Increasing the maximum size of the cache will improve 
performance. For small and medium sized vocabularies the cache can be 
configured 
+The size of the cache can be configured by using the 
`enhancer.engines.linking.lucenefst.entityCacheSize` parameter. The default 
size is ~65k entities. Increasing the maximum size of the cache will improve 
performance. 
+
+__TIP:__ For small and medium sized vocabularies the cache can be configured 
to be >= as the size of Entities in the Vocabulary. In this case the FST 
linking engine will full operate in-memory. For such scenarios linking was up 
to 100 times faster as with the [Entityhub Linking Engine](entityhublinking)
 
 
 ### Text Processing Configuration
@@ -150,9 +152,9 @@ The Entity Linking Configuration of this
 * <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label 
field is __IGNORED__ as the field holding the labels is anyway provided by the 
[FST Tagging Configuration]. That means that the field defined by the _stored_ 
parameter is used. If the _stored_ parameter is not present it fallbacks to the 
_field_ parameter.
 * <s>__Type Field__ _(enhancer.engines.linking.typeField)_</s>: This 
configuration gets __IGNORED__ in favor of the 
`enhancer.engines.linking.lucenefst.typeField`. See the [Additional Entity 
Information] section for details. 
 * __Redirect Field__ _(enhancer.engines.linking.redirectField)_</s>: Note 
implemented. __NOTE__ This might not be possible to efficiently implement. When 
those redirects need already be considered when building the FST models.
-* <s>__Use EntityRankings (enhancer.engines.linking.useEntityRankings)_</s>: 
This configuration gets __IGNORED__. EntityRanking based sorting is enabled as 
soon as the _Entity Ranking Field_ is configured.
+* <s>__Use EntityRankings__ 
_(enhancer.engines.linking.useEntityRankings)_</s>: This configuration gets 
__IGNORED__. EntityRanking based sorting is enabled as soon as the _Entity 
Ranking Field_ is configured.
 * <s>__Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_</s>: 
Not Yet implemented
-* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not 
Yet Implemented. Currently all linked Entities are added regardless of their 
score. However the way the Tagging is done makes it very unlikely to have 
suggestions with `fise:confidence` values less as 0.5.
+* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not 
Yet Implemented. The FST linking engine is based on the Lucene Analyzer chains 
configured for the _index_ and _store_ field of the FST configuration. Only if 
Tokens do match after the Analyzers where applied a Entity is suggested.
 
 In addition the following properties are __IGNORED__ as they are not relevant 
for the FST Linking Engine:

svn commit: r1528838 - in /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines: fstengine-config-fstfolder.png lucenefstlinking.mdtext

Reply via email to