Thanks for this detailed explanation.
Both usecases (with or without tokeniser) have justifications and usages
depending on the situation.
For workaround this,
- I first try to use regex request like "^Apache$", but this don't work
because - if I well remember - SolrYard don't accept regex request.
That's true ?
- I set up a loop that test each entity retrieve and select just the
exact matching one.
IMO, the choice between untokenised or tokenised version, could be
better if done in the code and not during the indexing.
For example, one could use
if (getUntokenisedResults("Westenthaler") == null){
getTokenisedResults("Westenthaler")}
Just an idea...
I'm not sure to well understand you last sentence :
> However I could imagine that this would require a lot of changes to
> the current code, because currently the code assumes that only one of
> language and data type is present at the same time.
For now, the skos I use is only in FR, but we imagine to add EN
information in it.
It could be possible to index and retrieve entities for the two
languages or not ?
Thanks for this really great enhancement.
Use SolrYard is so much faster that use a D2RQ link : for the same
(pretty big) document :
- 22 seconds for solrYard
- 2/3 min for D2RQ
++
On 06/10/2011 07:44 PM, Rupert Westenthaler wrote:
Hi
Good Question ...
Text Fields are indexed by using tokenizers in Solr. Therefore a
search for "Apache" will find all documents (entities) that have this
token for the skos:prefLabel field. This is the reason why you also
get "Apache fondation", "Apache bylaw", etc... even if the PatternType
is set to "none".
As far as I know the only way to go around this is to deactivate any
tokenizers for such filed. However without a tokenizer a query for
"Westenthaler" would not return "Rupert Westenthaler", what would also
be seen as strange by a lot of users.
To deactivate Tokenizers for a natural language field one needs to
modify the solr schema (schema.xml). Having both (tokenized and
un-tokenized) versions is currently not possible.
Here are the necessary additions to the schema.xml to deactivate
tokenizing for the skos:prefLabel
To get this you would need to add
<!-- one field for each language -->
<field name="@en/skos:prefLabel/" type="lowercase" indexed="true"
stored="true" multiValued="true"/>
<field name="@de/skos:prefLabel/" type="lowercase" indexed="true"
stored="true" multiValued="true"/>
<field name="@it/skos:prefLabel/" type="lowercase" indexed="true"
stored="true" multiValued="true"/>
<field name="@fr/skos:prefLabel/" type="lowercase" indexed="true"
stored="true" multiValued="true"/>
<field name="@/skos:prefLabel/" type="lowercase" indexed="true"
stored="true" multiValued="true"/>
<!-- used for multi lingual searches -->
<field name="_!@/skos:prefLabel/" type="lowercase" indexed="true"
stored="false" multiValued="true"/>
If this is a frequent feature I could modify the SolrYard to use
suffixes for languages. This would allow to index multiple versions of
natural language texts with different prefixes. The prefixes would
than indicate if a tokenizer should be used or not.
However I could imagine that this would require a lot of changes to
the current code, because currently the code assumes that only one of
language and data type is present at the same time.
best
Rupert Westenthaler
On Fri, Jun 10, 2011 at 11:07 AM, florent andré
<[email protected]> wrote:
Hi Rupert, *,
As promise in Berlin, I have a question for you ! :)
I have this query :
FieldQuery query = site.getQueryFactory().createFieldQuery();
query.setConstraint(NamespaceEnum.skos + "prefLabel",
new TextConstraint(signToFind));
query.addSelectedField(NamespaceEnum.skos + "related");
query.addSelectedField(NamespaceEnum.skos + "narrower");
query.addSelectedField(NamespaceEnum.skos + "broader");
query.addSelectedField(NamespaceEnum.skos + "inScheme");
query.setLimit(this.numSuggestions);
When the signToFind is a one word term eg "Apache", I get all composed term
that contain this word eg "Apache fondation", "Apache bylaw", etc...
That could be interesting in some case, but not always.
As I read in your documentation, there is :
- patternType: one of "wildcard", "regex" or "none" (default is "none")
As I don't define a pattern type, this could be in "none", so it could be a
strict matching, right ?
So, in this case I could have only one word term matching entity, or I miss
something ?
Thanks
++