[ 
https://issues.apache.org/jira/browse/STANBOL-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler resolved STANBOL-1153.
------------------------------------------

    Resolution: Fixed

fixed with http://svn.apache.org/r1521875

> Improve Solr schema used by the Entityhub SolrYard
> --------------------------------------------------
>
>                 Key: STANBOL-1153
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1153
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>              Labels: SolrYard
>
> While working on STANBOL-1128 and Issue10 of SolrTextTagger [1] I recognized 
> that the current default Solr schema use by the Entityhub SolrYard could be 
> improved in several ways:
> Here the list of improvements:
> * Some languages do use the solr.StandardTokenizerFactory together with the 
> solr.WordDelimiterFilterFactory. The WordDelimiterFilter should always be 
> used in combination with the WhitespaceTokenizer
> * The solr.WordDelimiterFilterFactory configuration is not optimal for 
> EntityLinking. It should be changed to
>     * splitOnCaseChange="0": For EntityLinking "PowerShot" should not be 
> splitted to "Power", "Shot"
>     * splitOnNumerics="0": Same is true for "j2se". We do not want suggest 
> this for "j 2 se"
>     * stemEnglishPossessive="1": removing tailing 's from words is OK. Even 
> for languages other then English
>     * generateWordParts="1": Splitting "Wi-Fi" to "Wi Fi" should improve 
> EntityLinking results. Maybe not for "Wi-Fi", but for 
> "Mercedes-Entwicklungsleiter". Note as splitOnCaseChange=0 words such as 
> "PowerShot" will still not be split.
>     * generateNumberParts="1": Splitting "500-42" is OK. Users should rather 
> decide if they would like to link number tokens of the text.
>     * catenateWords="1": Concatenation of words can only improve linking 
> results. So all kind of catenate* properties should be enabled. Disabled for 
> query
>     * catenateNumbers="1". Disabled for query
>     * catenateAll="1". Disabled for query
>     * preserveOriginal="1": Activated for indexing (e.g. to keep punctuation 
> marks in labels) but deactivated for query! Otherwise Entities at the end of 
> sentences could be ignored because of punctuations included in the token.
> * solr.ElisionFilterFactory after WordDelimiterFilter. This might cause 
> slower Phrase queries, but has the advantage that fields are compatible with 
> FST linking (SolrTextTagger).
> * NOT using solr.EnglishPossessiveFilterFactory and 
> solr.HyphenatedWordsFilterFactory as those do not provide additional 
> functionality if WordDelimiterFilter is present.
> * NOT enableing enablePositionIncrement for StopWordFilter as posInc > 1 is 
> not compatible with the SolrTextTagger library used by the FST linking engine
> * enable Norms for all fields (including non String and Text types): As the 
> Entityhub SolrYard supports index time boosts norms can be used to sort 
> results based on popularity of Entities.
> [1] https://github.com/OpenSextant/SolrTextTagger/issues/10



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to