Hi On 17 February 2011 16:28, Rupert Westenthaler <[email protected]> wrote: > Hi > >> There was a discussion a few weeks ago on the list about when closing >> issues. As I understood, fixed issues should be set to "Resolved" and >> issues will be closed when we do a release. > I can remember this discussion but honestly had not think about it when > closing this issue. > > However after checking I noticed, that "resolved" seams to be no > longer an option when closing an issue. > > The current options are > - fixed > - won't fix > - duplicate > - invalid > - incomplete > - cannot reproduce > - later > - not a problem > > So I suggest to use "fixed" in future +1 > > best > Rupert > > On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ > <[email protected]> wrote: >> Hi, >> >> this issue is "closed" with resolution "Fixed". >> >> There was a discussion a few weeks ago on the list about when closing >> issues. As I understood, fixed issues should be set to "Resolved" and >> issues will be closed when we do a release. >> >> - Fabian >> >> 2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>: >>> >>> [ >>> https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>> ] >>> >>> Rupert Westenthaler closed STANBOL-89. >>> -------------------------------------- >>> >>> Resolution: Fixed >>> >>> Fixed with Revision 1071231 >>> >>> This change does invalidate old indexes, because text searches within >>> string field had not really worked before. >>> However to benefit from this changes one would need to update the indices. >>> >>>> SolrYard uses string field for natural text queries >>>> --------------------------------------------------- >>>> >>>> Key: STANBOL-89 >>>> URL: https://issues.apache.org/jira/browse/STANBOL-89 >>>> Project: Stanbol >>>> Issue Type: Bug >>>> Components: Entity Hub >>>> Reporter: Rupert Westenthaler >>>> Assignee: Rupert Westenthaler >>>> Priority: Minor >>>> >>>> This describes a change to the way the SolrYard does index values with the >>>> data type xsd:string in order to improve the support for natural language >>>> text searches for such values. This change will remove a wrong assumption >>>> present in the current implementation. Details below! >>>> Background: >>>> The Entityhub distinguishes "natural language text" from normal values >>>> such as integer, floats, dates and string values. This is mainly because >>>> one might want to process natural language differently than normal string >>>> values. e.g. When processing natural language text one might want to use >>>> things like white space separators, stop word filters and/or stemming, but >>>> for ISBN numbers, article numbers, postal codes using such algorithms will >>>> use to unwanted effects. >>>> This distinction is nothing special to the Entityhub, but also present >>>> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang >>>> attribute) used to represent natural language text and "TypedLiterals" >>>> (with an optional xsd data type) to represent other values (including >>>> xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF >>>> model. >>>> Solr also provides a lot of functionality to improve the indexing and >>>> searching for natural language texts. Therefore the correct declaration of >>>> natural language texts and string values is of importance for getting the >>>> expected search results. >>>> For natural language texts the Solr schema.xml used by the SolrYard >>>> defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, >>>> WordDelimiterFilter and LowerCaseFilter. For English texts also the >>>> SnowballPorterFilter (stemming) is used. >>>> In contrast to that string field do not use any Tokenizer. >>>> The Problem: >>>> A lot of developers of applications that produce RDF data do not correctly >>>> use the RDF APIs. It is often the case that TypedLiterals with the data >>>> type xsd:string are used to create literals representing natural language >>>> texts. This is often because typically RDF APIs provide some kind of >>>> LiteralFactory to create RDF Literals for Java Objects. So parsing an Java >>>> String instance representing a natural language text will create a >>>> TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is >>>> no exception to that because it also creates TypedLiterals holding natural >>>> language texts! Developers usually only use PlainLiterals if there is a >>>> requirement to specify the language. >>>> The Conclusion is that components MUST NOT assume that string values do >>>> not represent natural language texts. However they can also not assume >>>> that all string values are in fact natural language texts. >>>> The best solution to that is to let the user define how to interpret the >>>> values when he interact with the data (at query time) >>>> Old Implementation: >>>> Previous to this change the SolrYard indexed "natural language text"s and >>>> "stirng" values differently. >>>> String values for a field where stored with the prefix "str" without any >>>> processing. >>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" >>>> for english texts, "@" for texts without a language) and processed by >>>> several tokenizers as described above. In addition texts where also stored >>>> within a field with the prefix "_!@" that combined all natural text values >>>> of all languages. >>>> To include string values in search results for natural language text >>>> queries for natural language texts where created to search also within the >>>> "str" field. Here an example for a Query for "Rupert" within the field >>>> "rdfs:label": >>>> "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)" >>>> However this had one important shortcoming. The second term of the query >>>> searched within a field that is not suited for natural language text >>>> searches. To describe that in more detail lets assume the value "Rupert >>>> Westenthaler" defined in the following two ways: >>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would >>>> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" >>>> and the "_!@/rdfs:label/" fields. >>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string >>>> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the >>>> "str/rdfs:label/" field. >>>> With (1) the above query would select the document in the second case it >>>> would not. This is because the query assumes to search for natural >>>> language values that are indexed in that way, but the "str/rdfs:label/" >>>> field does not fulfill this requirements >>>> Solution: >>>> The solution is to change the indexing to index string values also within >>>> the "_!@"-field. This means that searches within that field assumes that >>>> all string values do actually represent natural language texts. Searches >>>> for string values need to use the "str"-field. This assumes that string >>>> value searches (e.g. for an ISBN number) will still work as intended while >>>> searches for natural language texts do have also access to string values. >>>> As an positive side effect natural language searches will no longer need >>>> to search in two different fields (meaning the the OR clause as shown >>>> above in the example is no longer needed). >>>> Additional Note: >>>> It would be also possible to index natural language text values without >>>> defined language within the string field. This would remove the assumption >>>> that each natural language text value does in fact represent natural text >>>> and not a string. However until someone can point to real world cases >>>> where datasets do wrongly use PlainLiterals instead of TypedLiterals with >>>> the data type xsd:string there is no practical advantage to that. >>> >>> -- >>> This message is automatically generated by JIRA. >>> - >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >>> >> >> >> >> -- >> Fabian >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
-- Enrico Daga -- http://www.enridaga.net skype: enri-pan
