Author: rwesten
Date: Thu Oct 3 13:02:32 2013
New Revision: 1528838
URL: http://svn.apache.org/r1528838
Log:
STANBOL-1128: minor updates and corrections to the FST linking engine
documentation
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/fstengine-config-fstfolder.png?rev=1528838&r1=1528837&r2=1528838&view=diff
==============================================================================
Binary files - no diff available.
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext?rev=1528838&r1=1528837&r2=1528838&view=diff
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
(original)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/lucenefstlinking.mdtext
Thu Oct 3 13:02:32 2013
@@ -49,12 +49,12 @@ The Field Name Encodings work well with
This is the full list of supported Field encodings:
* SolrYard: This supports the encoding use by the Stanbol Entityhub SolrYard
implementation to encode RDF data types and language literals. If you configure
the FST Linking Engine for a Solr index build for the SolrYard you need to use
this encoding
-* MinusPrefix: {lang}-{field} (e.g. "en-name")
-* UnderscorePrefix: {lang}_{field} (e.g. "en_name")
-* AtPrefix: {lang}@{field} (e.g. "en@name")
-* MinusSuffix: {field}-{lang} (e.g. "name-en")
-* UnderscoreSuffix: {field}-{lang} (e.g. "name_en")
-* AtSuffix: {field}-{lang} (e.g. "name@en")
+* MinusPrefix: `{lang}-{field}` (e.g. "en-name")
+* UnderscorePrefix: `{lang}_{field}` (e.g. "en_name")
+* AtPrefix: `{lang}@{field}` (e.g. "en@name")
+* MinusSuffix: `{field}-{lang}` (e.g. "name-en")
+* UnderscoreSuffix: `{field}-{lang}` (e.g. "name_en")
+* AtSuffix: `{field}-{lang}` (e.g. "name@en")
* None: In this case no prefix/suffix rewriting of configured `field` and
`store` values is done. This means that the FST Configuration MUST define the
exact field names in the Solr index for every configured language.
#### FST Tagging Configuration
@@ -79,7 +79,7 @@ The following parameters are supported b
* __field__: The indexed field in the configured Solr index. In multilingual
scenarios this might be the 'base name' of the field that is extended by a
prefix or suffix to get the actual field name in the Solr index (see also the
field encoding configuration)
* __stored__ (default: _field_ value) : The field in the Solr index with the
stored label information. This parameter is optional. If not present `stored`
is assumed to be equals to `field`.
-* __fst__ (default based on _field_ value): Optionally allows to manually
specify the base file name of the FST models. Those files are assumed within
the data directory of the configured Solr index under `fst/{fst}.{lang}.fst`.
By default the configured `field` name is used (with non alpha-numeric chars
replaced by '_').If runtime creation is enabled those files will be created if
not present.
+* __fst__ (default based on _field_ value): This parameter allows to specify
the name of the FST file stored within the FST directory (as configured by the
[FST storage location]. The default name is generated by using the `field` with
non alpha-numeric chars replaced by '_').
* __generate__ (default: false): If enabled the Engine will generate missing
FST models. If this is enabled the engine will also be able to update FST
models after changes to the Solr Index. __NOTE__ that the creation of FST
models is an expensive operation (both CPU and memory wise). The FST engine
uses a pool of low priority threads to create FST models. The size of the pool
can be configured by using the
`enhancer.engines.linking.lucenefst.fstThreadPoolSize` parameter. Because of
this the default is `false`.
A more advanced Configuration might look like:
@@ -127,14 +127,16 @@ The configuration options does support p
* `solr-server-name`: the name of the
[ReferencedSolrServer](/docs/trunk/utils/commons-solr#referencedsolrserver) or
[ManagedSolrServer](/docs/trunk/utils/commons-solr#managedsolrserver) holding
the SolrCore (see also [Configuration of the Solr Index]
* `solr-core-name` : the name of the SolrCore
-The default value of this property is `${solr-data-dir}/fst`. To manage FST
models within the Stanbol folder you can us e.g.
`${sling.home}/fst/${solr-server-name}/solr-core-name`.
+The default value of this property is '`${solr-data-dir}/fst`'. To manage FST
models within the Stanbol folder you can us e.g.
'`${sling.home}/fst/${solr-server-name}/solr-core-name`'.
### Entity Cache Configuration
While FST tagging is fully done in-memory the FST linking engine needs to read
information of matching Entities from the Solr index. This requires disc IO and
is typically the part of the process that consumes the most time. The Entity
Cache tries to prevent such disc level IO by caching SolrDocuments containing
only fields required for the linking process (labels, types and (if available)
entity rankings). To further reduce memory requirements only labels in
languages requested by processed ContentItems are stored in the cache. The
Cache uses the LRU semantic and is based on the Solr cache implementation.
-The size of the cache can be configured by using the
`enhancer.engines.linking.lucenefst.entityCacheSize` parameter. The default
size is ~65k entities. Increasing the maximum size of the cache will improve
performance. For small and medium sized vocabularies the cache can be
configured
+The size of the cache can be configured by using the
`enhancer.engines.linking.lucenefst.entityCacheSize` parameter. The default
size is ~65k entities. Increasing the maximum size of the cache will improve
performance.
+
+__TIP:__ For small and medium sized vocabularies the cache can be configured
to be >= as the size of Entities in the Vocabulary. In this case the FST
linking engine will full operate in-memory. For such scenarios linking was up
to 100 times faster as with the [Entityhub Linking Engine](entityhublinking)
### Text Processing Configuration
@@ -150,9 +152,9 @@ The Entity Linking Configuration of this
* <s>__Label Field__ _(enhancer.engines.linking.labelField)_</s>: The label
field is __IGNORED__ as the field holding the labels is anyway provided by the
[FST Tagging Configuration]. That means that the field defined by the _stored_
parameter is used. If the _stored_ parameter is not present it fallbacks to the
_field_ parameter.
* <s>__Type Field__ _(enhancer.engines.linking.typeField)_</s>: This
configuration gets __IGNORED__ in favor of the
`enhancer.engines.linking.lucenefst.typeField`. See the [Additional Entity
Information] section for details.
* __Redirect Field__ _(enhancer.engines.linking.redirectField)_</s>: Note
implemented. __NOTE__ This might not be possible to efficiently implement. When
those redirects need already be considered when building the FST models.
-* <s>__Use EntityRankings (enhancer.engines.linking.useEntityRankings)_</s>:
This configuration gets __IGNORED__. EntityRanking based sorting is enabled as
soon as the _Entity Ranking Field_ is configured.
+* <s>__Use EntityRankings__
_(enhancer.engines.linking.useEntityRankings)_</s>: This configuration gets
__IGNORED__. EntityRanking based sorting is enabled as soon as the _Entity
Ranking Field_ is configured.
* <s>__Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_</s>:
Not Yet implemented
-* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not
Yet Implemented. Currently all linked Entities are added regardless of their
score. However the way the Tagging is done makes it very unlikely to have
suggestions with `fise:confidence` values less as 0.5.
+* <s>__Min Match Score__ _(enhancer.engines.linking.minMatchScore)_</s>: Not
Yet Implemented. The FST linking engine is based on the Lucene Analyzer chains
configured for the _index_ and _store_ field of the FST configuration. Only if
Tokens do match after the Analyzers where applied a Entity is suggested.
In addition the following properties are __IGNORED__ as they are not relevant
for the FST Linking Engine: