Rupert Westenthaler created STANBOL-1153:
--------------------------------------------
Summary: Improve Solr schema used by the Entityhub SolrYard
Key: STANBOL-1153
URL: https://issues.apache.org/jira/browse/STANBOL-1153
Project: Stanbol
Issue Type: Improvement
Components: Entityhub
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
While working on STANBOL-1128 and Issue10 of SolrTextTagger [1] I recognized
that the current default Solr schema use by the Entityhub SolrYard could be
improved in several ways:
Here the list of improvements:
* Some languages do use the solr.StandardTokenizerFactory together with the
solr.WordDelimiterFilterFactory. The WordDelimiterFilter should always be used
in combination with the WhitespaceTokenizer
* The solr.WordDelimiterFilterFactory configuration is not optimal for
EntityLinking. It should be changed to
* splitOnCaseChange="0": For EntityLinking "PowerShot" should not be
splitted to "Power", "Shot"
* splitOnNumerics="0": Same is true for "j2se". We do not want suggest this
for "j 2 se"
* stemEnglishPossessive="1": removing tailing 's from words is OK. Even for
languages other then English
* generateWordParts="1": Splitting "Wi-Fi" to "Wi Fi" should improve
EntityLinking results. Maybe not for "Wi-Fi", but for
"Mercedes-Entwicklungsleiter". Note as splitOnCaseChange=0 words such as
"PowerShot" will still not be split.
* generateNumberParts="1": Splitting "500-42" is OK. Users should rather
decide if they would like to link number tokens of the text.
* catenateWords="1": Concatenation of words can only improve linking
results. So all kind of catenate* properties should be enabled. Disabled for
query
* catenateNumbers="1". Disabled for query
* catenateAll="1". Disabled for query
* preserveOriginal="1": Activated for indexing (e.g. to keep punctuation
marks in labels) but deactivated for query! Otherwise Entities at the end of
sentences could be ignored because of punctuations included in the token.
* solr.ElisionFilterFactory after WordDelimiterFilter. This might cause slower
Phrase queries, but has the advantage that fields are compatible with FST
linking (SolrTextTagger).
* NOT using solr.EnglishPossessiveFilterFactory and
solr.HyphenatedWordsFilterFactory as those do not provide additional
functionality if WordDelimiterFilter is present.
* NOT enableing enablePositionIncrement for StopWordFilter as posInc > 1 is not
compatible with the SolrTextTagger library used by the FST linking engine
[1] https://github.com/OpenSextant/SolrTextTagger/issues/10
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira