Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Enrico Daga Thu, 17 Feb 2011 07:32:07 -0800

Hi

On 17 February 2011 16:28, Rupert Westenthaler <[email protected]> wrote:
> Hi
>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
> I can remember this discussion but honestly had not think about it when
> closing this issue.
>
> However after checking I noticed, that "resolved" seams to be no
> longer an option when closing an issue.
>
> The current options are
>  - fixed
>  - won't fix
>  - duplicate
>  - invalid
>  - incomplete
>  - cannot reproduce
>  - later
>  - not a problem
>
> So I suggest to use "fixed" in future
+1
>
> best
> Rupert
>
> On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ
> <[email protected]> wrote:
>> Hi,
>>
>> this issue is "closed" with resolution "Fixed".
>>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
>>
>>  - Fabian
>>
>> 2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>:
>>>
>>>     [ 
>>> https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>>  ]
>>>
>>> Rupert Westenthaler closed STANBOL-89.
>>> --------------------------------------
>>>
>>>    Resolution: Fixed
>>>
>>> Fixed with Revision 1071231
>>>
>>> This change does invalidate old indexes, because text searches within 
>>> string field had not really worked before.
>>> However to benefit from this changes one would need to update the indices.
>>>
>>>> SolrYard uses string field for natural text queries
>>>> ---------------------------------------------------
>>>>
>>>>                 Key: STANBOL-89
>>>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>>>             Project: Stanbol
>>>>          Issue Type: Bug
>>>>          Components: Entity Hub
>>>>            Reporter: Rupert Westenthaler
>>>>            Assignee: Rupert Westenthaler
>>>>            Priority: Minor
>>>>
>>>> This describes a change to the way the SolrYard does index values with the 
>>>> data type xsd:string in order to improve the support for natural language 
>>>> text searches for such values. This change will remove a wrong assumption 
>>>> present in the current implementation. Details below!
>>>> Background:
>>>> The Entityhub distinguishes "natural language text" from normal values 
>>>> such as integer, floats, dates and string values. This is mainly because 
>>>> one might want to process natural language differently than normal string 
>>>> values. e.g. When processing natural language text one might want to use 
>>>> things like white space separators, stop word filters and/or stemming, but 
>>>> for ISBN numbers, article numbers, postal codes using such algorithms will 
>>>> use to unwanted effects.
>>>> This distinction is nothing special to the Entityhub, but also present 
>>>> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang 
>>>> attribute) used to represent natural language text and "TypedLiterals" 
>>>> (with an optional xsd data type) to represent other values (including 
>>>> xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF 
>>>> model.
>>>> Solr also provides a lot of functionality to improve the indexing and 
>>>> searching for natural language texts. Therefore the correct declaration of 
>>>> natural language texts and string values is of importance for getting the 
>>>> expected search results.
>>>> For natural language texts the Solr schema.xml used by the SolrYard 
>>>> defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, 
>>>> WordDelimiterFilter and LowerCaseFilter. For English texts also the 
>>>> SnowballPorterFilter (stemming) is used.
>>>> In contrast to that string field do not use any Tokenizer.
>>>> The Problem:
>>>> A lot of developers of applications that produce RDF data do not correctly 
>>>> use the RDF APIs. It is often the case that TypedLiterals with the data 
>>>> type xsd:string are used to create literals representing natural language 
>>>> texts. This is often because typically RDF APIs provide some kind of 
>>>> LiteralFactory to create RDF Literals for Java Objects. So parsing an Java 
>>>> String instance representing a natural language text will create a 
>>>> TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is 
>>>> no exception to that because it also creates TypedLiterals holding natural 
>>>> language texts! Developers usually only use PlainLiterals if there is a 
>>>> requirement to specify the language.
>>>> The Conclusion is that components MUST NOT assume that string values do 
>>>> not represent natural language texts. However they can also not assume 
>>>> that all string values are in fact natural language texts.
>>>> The best solution to that is to let the user define how to interpret the 
>>>> values when he interact with the data (at query time)
>>>> Old Implementation:
>>>> Previous to this change the SolrYard indexed "natural language text"s and 
>>>> "stirng" values differently.
>>>> String values for a field where stored with the prefix "str" without any 
>>>> processing.
>>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" 
>>>> for english texts, "@" for texts without a language) and processed by 
>>>> several tokenizers as described above. In addition texts where also stored 
>>>> within a field with the prefix "_!@" that combined all natural text values 
>>>> of all languages.
>>>> To include string values in search results for natural language text 
>>>> queries for natural language texts where created to search also within the 
>>>> "str" field. Here an example for a Query for "Rupert" within the field 
>>>> "rdfs:label":
>>>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>>>> However this had one important shortcoming. The second term of the query 
>>>> searched within a field that is not suited for natural language text 
>>>> searches. To describe that in more detail lets assume the value "Rupert 
>>>> Westenthaler" defined in the following two ways:
>>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would 
>>>> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" 
>>>> and the "_!@/rdfs:label/" fields.
>>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string 
>>>> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the 
>>>> "str/rdfs:label/" field.
>>>> With (1) the above query would select the document in the second case it 
>>>> would not. This is because the query assumes to search for natural 
>>>> language values that are indexed in that way, but the "str/rdfs:label/" 
>>>> field does not fulfill this requirements
>>>> Solution:
>>>> The solution is to change the indexing to index string values also within 
>>>> the "_!@"-field. This means that searches within that field assumes that 
>>>> all string values do actually represent natural language texts. Searches 
>>>> for string values need to use the "str"-field. This assumes that string 
>>>> value searches (e.g. for an ISBN number) will still work as intended while 
>>>> searches for natural language texts do have also access to string values.
>>>> As an positive side effect natural language searches will no longer need 
>>>> to search in two different fields (meaning the the OR clause as shown 
>>>> above in the example is no longer needed).
>>>> Additional Note:
>>>> It would be also possible to index natural language text values without 
>>>> defined language within the string field. This would remove the assumption 
>>>> that each natural language text value does in fact represent natural text 
>>>> and not a string. However until someone can point to real world cases 
>>>> where datasets do wrongly use PlainLiterals instead of TypedLiterals with 
>>>> the data type xsd:string there is no practical advantage to that.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>>>
>>
>>
>>
>> --
>> Fabian
>>
>
>
>
> --
> | Rupert Westenthaler                            [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>




-- 
Enrico Daga

--
http://www.enridaga.net
skype: enri-pan

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Reply via email to