Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Rupert Westenthaler Thu, 17 Feb 2011 07:29:24 -0800

Hi

> There was a discussion a few weeks ago on the list about when closing
> issues. As I understood, fixed issues should be set to "Resolved" and
> issues will be closed when we do a release.
I can remember this discussion but honestly had not think about it when
closing this issue.


However after checking I noticed, that "resolved" seams to be no
longer an option when closing an issue.

The current options are
 - fixed
 - won't fix
 - duplicate
 - invalid
 - incomplete
 - cannot reproduce
 - later
 - not a problem

So I suggest to use "fixed" in future

best
Rupert

On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ
<[email protected]> wrote:
> Hi,
>
> this issue is "closed" with resolution "Fixed".
>
> There was a discussion a few weeks ago on the list about when closing
> issues. As I understood, fixed issues should be set to "Resolved" and
> issues will be closed when we do a release.
>
>  - Fabian
>
> 2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>:
>>
>>     [ 
>> https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>  ]
>>
>> Rupert Westenthaler closed STANBOL-89.
>> --------------------------------------
>>
>>    Resolution: Fixed
>>
>> Fixed with Revision 1071231
>>
>> This change does invalidate old indexes, because text searches within string 
>> field had not really worked before.
>> However to benefit from this changes one would need to update the indices.
>>
>>> SolrYard uses string field for natural text queries
>>> ---------------------------------------------------
>>>
>>>                 Key: STANBOL-89
>>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>>             Project: Stanbol
>>>          Issue Type: Bug
>>>          Components: Entity Hub
>>>            Reporter: Rupert Westenthaler
>>>            Assignee: Rupert Westenthaler
>>>            Priority: Minor
>>>
>>> This describes a change to the way the SolrYard does index values with the 
>>> data type xsd:string in order to improve the support for natural language 
>>> text searches for such values. This change will remove a wrong assumption 
>>> present in the current implementation. Details below!
>>> Background:
>>> The Entityhub distinguishes "natural language text" from normal values such 
>>> as integer, floats, dates and string values. This is mainly because one 
>>> might want to process natural language differently than normal string 
>>> values. e.g. When processing natural language text one might want to use 
>>> things like white space separators, stop word filters and/or stemming, but 
>>> for ISBN numbers, article numbers, postal codes using such algorithms will 
>>> use to unwanted effects.
>>> This distinction is nothing special to the Entityhub, but also present 
>>> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang 
>>> attribute) used to represent natural language text and "TypedLiterals" 
>>> (with an optional xsd data type) to represent other values (including 
>>> xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF 
>>> model.
>>> Solr also provides a lot of functionality to improve the indexing and 
>>> searching for natural language texts. Therefore the correct declaration of 
>>> natural language texts and string values is of importance for getting the 
>>> expected search results.
>>> For natural language texts the Solr schema.xml used by the SolrYard defines 
>>> a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, 
>>> WordDelimiterFilter and LowerCaseFilter. For English texts also the 
>>> SnowballPorterFilter (stemming) is used.
>>> In contrast to that string field do not use any Tokenizer.
>>> The Problem:
>>> A lot of developers of applications that produce RDF data do not correctly 
>>> use the RDF APIs. It is often the case that TypedLiterals with the data 
>>> type xsd:string are used to create literals representing natural language 
>>> texts. This is often because typically RDF APIs provide some kind of 
>>> LiteralFactory to create RDF Literals for Java Objects. So parsing an Java 
>>> String instance representing a natural language text will create a 
>>> TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no 
>>> exception to that because it also creates TypedLiterals holding natural 
>>> language texts! Developers usually only use PlainLiterals if there is a 
>>> requirement to specify the language.
>>> The Conclusion is that components MUST NOT assume that string values do not 
>>> represent natural language texts. However they can also not assume that all 
>>> string values are in fact natural language texts.
>>> The best solution to that is to let the user define how to interpret the 
>>> values when he interact with the data (at query time)
>>> Old Implementation:
>>> Previous to this change the SolrYard indexed "natural language text"s and 
>>> "stirng" values differently.
>>> String values for a field where stored with the prefix "str" without any 
>>> processing.
>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" 
>>> for english texts, "@" for texts without a language) and processed by 
>>> several tokenizers as described above. In addition texts where also stored 
>>> within a field with the prefix "_!@" that combined all natural text values 
>>> of all languages.
>>> To include string values in search results for natural language text 
>>> queries for natural language texts where created to search also within the 
>>> "str" field. Here an example for a Query for "Rupert" within the field 
>>> "rdfs:label":
>>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>>> However this had one important shortcoming. The second term of the query 
>>> searched within a field that is not suited for natural language text 
>>> searches. To describe that in more detail lets assume the value "Rupert 
>>> Westenthaler" defined in the following two ways:
>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would 
>>> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" 
>>> and the "_!@/rdfs:label/" fields.
>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string 
>>> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the 
>>> "str/rdfs:label/" field.
>>> With (1) the above query would select the document in the second case it 
>>> would not. This is because the query assumes to search for natural language 
>>> values that are indexed in that way, but the "str/rdfs:label/" field does 
>>> not fulfill this requirements
>>> Solution:
>>> The solution is to change the indexing to index string values also within 
>>> the "_!@"-field. This means that searches within that field assumes that 
>>> all string values do actually represent natural language texts. Searches 
>>> for string values need to use the "str"-field. This assumes that string 
>>> value searches (e.g. for an ISBN number) will still work as intended while 
>>> searches for natural language texts do have also access to string values.
>>> As an positive side effect natural language searches will no longer need to 
>>> search in two different fields (meaning the the OR clause as shown above in 
>>> the example is no longer needed).
>>> Additional Note:
>>> It would be also possible to index natural language text values without 
>>> defined language within the string field. This would remove the assumption 
>>> that each natural language text value does in fact represent natural text 
>>> and not a string. However until someone can point to real world cases where 
>>> datasets do wrongly use PlainLiterals instead of TypedLiterals with the 
>>> data type xsd:string there is no practical advantage to that.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>
>
>
> --
> Fabian
>



-- 
| Rupert Westenthaler                            [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Reply via email to