Hi, this issue is "closed" with resolution "Fixed".
There was a discussion a few weeks ago on the list about when closing issues. As I understood, fixed issues should be set to "Resolved" and issues will be closed when we do a release. - Fabian 2011/2/16 Rupert Westenthaler (JIRA) <[email protected]>: > > [ > https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Rupert Westenthaler closed STANBOL-89. > -------------------------------------- > > Resolution: Fixed > > Fixed with Revision 1071231 > > This change does invalidate old indexes, because text searches within string > field had not really worked before. > However to benefit from this changes one would need to update the indices. > >> SolrYard uses string field for natural text queries >> --------------------------------------------------- >> >> Key: STANBOL-89 >> URL: https://issues.apache.org/jira/browse/STANBOL-89 >> Project: Stanbol >> Issue Type: Bug >> Components: Entity Hub >> Reporter: Rupert Westenthaler >> Assignee: Rupert Westenthaler >> Priority: Minor >> >> This describes a change to the way the SolrYard does index values with the >> data type xsd:string in order to improve the support for natural language >> text searches for such values. This change will remove a wrong assumption >> present in the current implementation. Details below! >> Background: >> The Entityhub distinguishes "natural language text" from normal values such >> as integer, floats, dates and string values. This is mainly because one >> might want to process natural language differently than normal string >> values. e.g. When processing natural language text one might want to use >> things like white space separators, stop word filters and/or stemming, but >> for ISBN numbers, article numbers, postal codes using such algorithms will >> use to unwanted effects. >> This distinction is nothing special to the Entityhub, but also present >> within RDF. RDF defines "PlainLiterals" (with an optional xml:lang >> attribute) used to represent natural language text and "TypedLiterals" (with >> an optional xsd data type) to represent other values (including xsd:string). >> This is also represented in the RDF APIs incl. Clerezzas RDF model. >> Solr also provides a lot of functionality to improve the indexing and >> searching for natural language texts. Therefore the correct declaration of >> natural language texts and string values is of importance for getting the >> expected search results. >> For natural language texts the Solr schema.xml used by the SolrYard defines >> a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, >> WordDelimiterFilter and LowerCaseFilter. For English texts also the >> SnowballPorterFilter (stemming) is used. >> In contrast to that string field do not use any Tokenizer. >> The Problem: >> A lot of developers of applications that produce RDF data do not correctly >> use the RDF APIs. It is often the case that TypedLiterals with the data type >> xsd:string are used to create literals representing natural language texts. >> This is often because typically RDF APIs provide some kind of LiteralFactory >> to create RDF Literals for Java Objects. So parsing an Java String instance >> representing a natural language text will create a TypedLiteral with the >> data type xsd:string. Even the Stanbol Enhancer is no exception to that >> because it also creates TypedLiterals holding natural language texts! >> Developers usually only use PlainLiterals if there is a requirement to >> specify the language. >> The Conclusion is that components MUST NOT assume that string values do not >> represent natural language texts. However they can also not assume that all >> string values are in fact natural language texts. >> The best solution to that is to let the user define how to interpret the >> values when he interact with the data (at query time) >> Old Implementation: >> Previous to this change the SolrYard indexed "natural language text"s and >> "stirng" values differently. >> String values for a field where stored with the prefix "str" without any >> processing. >> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" >> for english texts, "@" for texts without a language) and processed by >> several tokenizers as described above. In addition texts where also stored >> within a field with the prefix "_!@" that combined all natural text values >> of all languages. >> To include string values in search results for natural language text queries >> for natural language texts where created to search also within the "str" >> field. Here an example for a Query for "Rupert" within the field >> "rdfs:label": >> "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)" >> However this had one important shortcoming. The second term of the query >> searched within a field that is not suited for natural language text >> searches. To describe that in more detail lets assume the value "Rupert >> Westenthaler" defined in the following two ways: >> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would >> end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" >> and the "_!@/rdfs:label/" fields. >> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string >> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the >> "str/rdfs:label/" field. >> With (1) the above query would select the document in the second case it >> would not. This is because the query assumes to search for natural language >> values that are indexed in that way, but the "str/rdfs:label/" field does >> not fulfill this requirements >> Solution: >> The solution is to change the indexing to index string values also within >> the "_!@"-field. This means that searches within that field assumes that all >> string values do actually represent natural language texts. Searches for >> string values need to use the "str"-field. This assumes that string value >> searches (e.g. for an ISBN number) will still work as intended while >> searches for natural language texts do have also access to string values. >> As an positive side effect natural language searches will no longer need to >> search in two different fields (meaning the the OR clause as shown above in >> the example is no longer needed). >> Additional Note: >> It would be also possible to index natural language text values without >> defined language within the string field. This would remove the assumption >> that each natural language text value does in fact represent natural text >> and not a string. However until someone can point to real world cases where >> datasets do wrongly use PlainLiterals instead of TypedLiterals with the data >> type xsd:string there is no practical advantage to that. > > -- > This message is automatically generated by JIRA. > - > For more information on JIRA, see: http://www.atlassian.com/software/jira > > > -- Fabian
