[jira] [Created] (STANBOL-1153) Improve Solr schema used by the Entityhub SolrYard

Rupert Westenthaler (JIRA) Tue, 10 Sep 2013 05:36:45 -0700

Rupert Westenthaler created STANBOL-1153:
--------------------------------------------


             Summary: Improve Solr schema used by the Entityhub SolrYard
                 Key: STANBOL-1153
                 URL: https://issues.apache.org/jira/browse/STANBOL-1153
             Project: Stanbol
          Issue Type: Improvement
          Components: Entityhub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


While working on STANBOL-1128 and Issue10 of SolrTextTagger [1] I recognized 
that the current default Solr schema use by the Entityhub SolrYard could be 
improved in several ways:

Here the list of improvements:

* Some languages do use the solr.StandardTokenizerFactory together with the 
solr.WordDelimiterFilterFactory. The WordDelimiterFilter should always be used 
in combination with the WhitespaceTokenizer

* The solr.WordDelimiterFilterFactory configuration is not optimal for 
EntityLinking. It should be changed to
    * splitOnCaseChange="0": For EntityLinking "PowerShot" should not be 
splitted to "Power", "Shot"
    * splitOnNumerics="0": Same is true for "j2se". We do not want suggest this 
for "j 2 se"
    * stemEnglishPossessive="1": removing tailing 's from words is OK. Even for 
languages other then English
    * generateWordParts="1": Splitting "Wi-Fi" to "Wi Fi" should improve 
EntityLinking results. Maybe not for "Wi-Fi", but for 
"Mercedes-Entwicklungsleiter". Note as splitOnCaseChange=0 words such as 
"PowerShot" will still not be split.
    * generateNumberParts="1": Splitting "500-42" is OK. Users should rather 
decide if they would like to link number tokens of the text.
    * catenateWords="1": Concatenation of words can only improve linking 
results. So all kind of catenate* properties should be enabled. Disabled for 
query
    * catenateNumbers="1". Disabled for query
    * catenateAll="1". Disabled for query
    * preserveOriginal="1": Activated for indexing (e.g. to keep punctuation 
marks in labels) but deactivated for query! Otherwise Entities at the end of 
sentences could be ignored because of punctuations included in the token.

* solr.ElisionFilterFactory after WordDelimiterFilter. This might cause slower 
Phrase queries, but has the advantage that fields are compatible with FST 
linking (SolrTextTagger).
* NOT using solr.EnglishPossessiveFilterFactory and 
solr.HyphenatedWordsFilterFactory as those do not provide additional 
functionality if WordDelimiterFilter is present.
* NOT enableing enablePositionIncrement for StopWordFilter as posInc > 1 is not 
compatible with the SolrTextTagger library used by the FST linking engine






[1] https://github.com/OpenSextant/SolrTextTagger/issues/10

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (STANBOL-1153) Improve Solr schema used by the Entityhub SolrYard

Reply via email to