Hi > There was a discussion a few weeks ago on the list about when closing > issues. As I understood, fixed issues should be set to "Resolved" and > issues will be closed when we do a release. I can remember this discussion but honestly had not think about it when closing this issue.
However after checking I noticed, that "resolved" seams to be no longer an option when closing an issue. The current options are - fixed - won't fix - duplicate - invalid - incomplete - cannot reproduce - later - not a problem So I suggest to use "fixed" in future best Rupert On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ <[email protected]> wrote: > Hi, > > this issue is "closed" with resolution "Fixed". > > There was a discussion a few weeks ago on the list about when closing > issues. As I understood, fixed issues should be set to "Resolved" and > issues will be closed when we do a release. > > - Fabian > > 2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>: >> >> [ >> https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> ] >> >> Rupert Westenthaler closed STANBOL-89. >> -------------------------------------- >> >> Resolution: Fixed >> >> Fixed with Revision 1071231 >> >> This change does invalidate old indexes, because text searches within string >> field had not really worked before. >> However to benefit from this changes one would need to update the indices. >> >>> SolrYard uses string field for natural text queries >>> --------------------------------------------------- >>> >>> Key: STANBOL-89 >>> URL: https://issues.apache.org/jira/browse/STANBOL-89 >>> Project: Stanbol >>> Issue Type: Bug >>> Components: Entity Hub >>> Reporter: Rupert Westenthaler >>> Assignee: Rupert Westenthaler >>> Priority: Minor >>> >>> This describes a change to the way the SolrYard does index values with the >>> data type xsd:string in order to improve the support for natural language >>> text searches for such values. This change will remove a wrong assumption >>> present in the current implementation. Details below! >>> Background: >>> The Entityhub distinguishes "natural language text" from normal values such >>> as integer, floats, dates and string values. This is mainly because one >>> might want to process natural language differently than normal string >>> values. e.g. When processing natural language text one might want to use >>> things like white space separators, stop word filters and/or stemming, but >>> for ISBN numbers, article numbers, postal codes using such algorithms will >>> use to unwanted effects. >>> This distinction is nothing special to the Entityhub, but also present >>> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang >>> attribute) used to represent natural language text and "TypedLiterals" >>> (with an optional xsd data type) to represent other values (including >>> xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF >>> model. >>> Solr also provides a lot of functionality to improve the indexing and >>> searching for natural language texts. Therefore the correct declaration of >>> natural language texts and string values is of importance for getting the >>> expected search results. >>> For natural language texts the Solr schema.xml used by the SolrYard defines >>> a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, >>> WordDelimiterFilter and LowerCaseFilter. For English texts also the >>> SnowballPorterFilter (stemming) is used. >>> In contrast to that string field do not use any Tokenizer. >>> The Problem: >>> A lot of developers of applications that produce RDF data do not correctly >>> use the RDF APIs. It is often the case that TypedLiterals with the data >>> type xsd:string are used to create literals representing natural language >>> texts. This is often because typically RDF APIs provide some kind of >>> LiteralFactory to create RDF Literals for Java Objects. So parsing an Java >>> String instance representing a natural language text will create a >>> TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no >>> exception to that because it also creates TypedLiterals holding natural >>> language texts! Developers usually only use PlainLiterals if there is a >>> requirement to specify the language. >>> The Conclusion is that components MUST NOT assume that string values do not >>> represent natural language texts. However they can also not assume that all >>> string values are in fact natural language texts. >>> The best solution to that is to let the user define how to interpret the >>> values when he interact with the data (at query time) >>> Old Implementation: >>> Previous to this change the SolrYard indexed "natural language text"s and >>> "stirng" values differently. >>> String values for a field where stored with the prefix "str" without any >>> processing. >>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" >>> for english texts, "@" for texts without a language) and processed by >>> several tokenizers as described above. In addition texts where also stored >>> within a field with the prefix "_!@" that combined all natural text values >>> of all languages. >>> To include string values in search results for natural language text >>> queries for natural language texts where created to search also within the >>> "str" field. Here an example for a Query for "Rupert" within the field >>> "rdfs:label": >>> "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)" >>> However this had one important shortcoming. The second term of the query >>> searched within a field that is not suited for natural language text >>> searches. To describe that in more detail lets assume the value "Rupert >>> Westenthaler" defined in the following two ways: >>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would >>> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" >>> and the "_!@/rdfs:label/" fields. >>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string >>> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the >>> "str/rdfs:label/" field. >>> With (1) the above query would select the document in the second case it >>> would not. This is because the query assumes to search for natural language >>> values that are indexed in that way, but the "str/rdfs:label/" field does >>> not fulfill this requirements >>> Solution: >>> The solution is to change the indexing to index string values also within >>> the "_!@"-field. This means that searches within that field assumes that >>> all string values do actually represent natural language texts. Searches >>> for string values need to use the "str"-field. This assumes that string >>> value searches (e.g. for an ISBN number) will still work as intended while >>> searches for natural language texts do have also access to string values. >>> As an positive side effect natural language searches will no longer need to >>> search in two different fields (meaning the the OR clause as shown above in >>> the example is no longer needed). >>> Additional Note: >>> It would be also possible to index natural language text values without >>> defined language within the string field. This would remove the assumption >>> that each natural language text value does in fact represent natural text >>> and not a string. However until someone can point to real world cases where >>> datasets do wrongly use PlainLiterals instead of TypedLiterals with the >>> data type xsd:string there is no practical advantage to that. >> >> -- >> This message is automatically generated by JIRA. >> - >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> >> > > > > -- > Fabian > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
