Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Fabian Christ Thu, 17 Feb 2011 06:07:35 -0800

Hi,

this issue is "closed" with resolution "Fixed".


There was a discussion a few weeks ago on the list about when closing
issues. As I understood, fixed issues should be set to "Resolved" and
issues will be closed when we do a release.

 - Fabian

2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>:
>
>     [ 
> https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Rupert Westenthaler closed STANBOL-89.
> --------------------------------------
>
>    Resolution: Fixed
>
> Fixed with Revision 1071231
>
> This change does invalidate old indexes, because text searches within string 
> field had not really worked before.
> However to benefit from this changes one would need to update the indices.
>
>> SolrYard uses string field for natural text queries
>> ---------------------------------------------------
>>
>>                 Key: STANBOL-89
>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>             Project: Stanbol
>>          Issue Type: Bug
>>          Components: Entity Hub
>>            Reporter: Rupert Westenthaler
>>            Assignee: Rupert Westenthaler
>>            Priority: Minor
>>
>> This describes a change to the way the SolrYard does index values with the 
>> data type xsd:string in order to improve the support for natural language 
>> text searches for such values. This change will remove a wrong assumption 
>> present in the current implementation. Details below!
>> Background:
>> The Entityhub distinguishes "natural language text" from normal values such 
>> as integer, floats, dates and string values. This is mainly because one 
>> might want to process natural language differently than normal string 
>> values. e.g. When processing natural language text one might want to use 
>> things like white space separators, stop word filters and/or stemming, but 
>> for ISBN numbers, article numbers, postal codes using such algorithms will 
>> use to unwanted effects.
>> This distinction is nothing special to the Entityhub, but also present 
>> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang 
>> attribute) used to represent natural language text and "TypedLiterals" (with 
>> an optional xsd data type) to represent other values (including xsd:string). 
>> This is also represented in the RDF APIs incl. Clerezzas RDF model.
>> Solr also provides a lot of functionality to improve the indexing and 
>> searching for natural language texts. Therefore the correct declaration of 
>> natural language texts and string values is of importance for getting the 
>> expected search results.
>> For natural language texts the Solr schema.xml used by the SolrYard defines 
>> a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, 
>> WordDelimiterFilter and LowerCaseFilter. For English texts also the 
>> SnowballPorterFilter (stemming) is used.
>> In contrast to that string field do not use any Tokenizer.
>> The Problem:
>> A lot of developers of applications that produce RDF data do not correctly 
>> use the RDF APIs. It is often the case that TypedLiterals with the data type 
>> xsd:string are used to create literals representing natural language texts. 
>> This is often because typically RDF APIs provide some kind of LiteralFactory 
>> to create RDF Literals for Java Objects. So parsing an Java String instance 
>> representing a natural language text will create a TypedLiteral with the 
>> data type xsd:string. Even the Stanbol Enhancer is no exception to that 
>> because it also creates TypedLiterals holding natural language texts! 
>> Developers usually only use PlainLiterals if there is a requirement to 
>> specify the language.
>> The Conclusion is that components MUST NOT assume that string values do not 
>> represent natural language texts. However they can also not assume that all 
>> string values are in fact natural language texts.
>> The best solution to that is to let the user define how to interpret the 
>> values when he interact with the data (at query time)
>> Old Implementation:
>> Previous to this change the SolrYard indexed "natural language text"s and 
>> "stirng" values differently.
>> String values for a field where stored with the prefix "str" without any 
>> processing.
>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" 
>> for english texts, "@" for texts without a language) and processed by 
>> several tokenizers as described above. In addition texts where also stored 
>> within a field with the prefix "_!@" that combined all natural text values 
>> of all languages.
>> To include string values in search results for natural language text queries 
>> for natural language texts where created to search also within the "str" 
>> field. Here an example for a Query for "Rupert" within the field 
>> "rdfs:label":
>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>> However this had one important shortcoming. The second term of the query 
>> searched within a field that is not suited for natural language text 
>> searches. To describe that in more detail lets assume the value "Rupert 
>> Westenthaler" defined in the following two ways:
>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would 
>> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" 
>> and the "_!@/rdfs:label/" fields.
>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string 
>> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the 
>> "str/rdfs:label/" field.
>> With (1) the above query would select the document in the second case it 
>> would not. This is because the query assumes to search for natural language 
>> values that are indexed in that way, but the "str/rdfs:label/" field does 
>> not fulfill this requirements
>> Solution:
>> The solution is to change the indexing to index string values also within 
>> the "_!@"-field. This means that searches within that field assumes that all 
>> string values do actually represent natural language texts. Searches for 
>> string values need to use the "str"-field. This assumes that string value 
>> searches (e.g. for an ISBN number) will still work as intended while 
>> searches for natural language texts do have also access to string values.
>> As an positive side effect natural language searches will no longer need to 
>> search in two different fields (meaning the the OR clause as shown above in 
>> the example is no longer needed).
>> Additional Note:
>> It would be also possible to index natural language text values without 
>> defined language within the string field. This would remove the assumption 
>> that each natural language text value does in fact represent natural text 
>> and not a string. However until someone can point to real world cases where 
>> datasets do wrongly use PlainLiterals instead of TypedLiterals with the data 
>> type xsd:string there is no practical advantage to that.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>



-- 
Fabian

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Reply via email to