[ 
https://issues.apache.org/jira/browse/STANBOL-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058554#comment-13058554
 ] 

Rupert Westenthaler commented on STANBOL-246:
---------------------------------------------

This is also related to the use of Tokenizers, because if an untokenized fields 
would be used, than a query for "united states" would not even return the other 
Entities. See also this discussion [1] about a similar topic.

Currently tokenized fields are used to store any kind of natural language 
texts. There is no possibility to distinguish between lets say "labels" and 
"longer texts". Requests like this result in queries for values containing 
tokens ("united" AND "states" in the given example). Only because of this we 
are in the situation that we need to rank multiple results even that there is 
also an exact match.

I do not think that the use of FuzzyQuery is a solution for this. E.g. lets use 
the case where one searches for persons based on the family name. I would 
expect that a Fuzzy query rank the person with the shortest given name highest, 
because it has the smallest distance to the search term? Page Rank would 
instead return the most popular person as first result and currently the Entity 
with the most incoming links is ranked as first. 

Maybe there is also a possibility to tell Solr to search only for exact matches 
or to rank exact matches first.

Regardless of that if FuzzyQuery is available in Solr we should add it as 
possibility to the TextConstraint!

[1] 
http://markmail.org/thread/gyuyxo4sn33dknph#query:+page:1+mid:b5mr2cx4oiutkvli+state:results

> Exact name match should get boosted in the entity hub SolrYard indices
> ----------------------------------------------------------------------
>
>                 Key: STANBOL-246
>                 URL: https://issues.apache.org/jira/browse/STANBOL-246
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Rupert Westenthaler
>         Attachments: united_states_dbpedia_solrindex.json
>
>
> For instance, using the default embedded solryard index:
> {code}
>  curl -X POST -d "name=United States&limit=10&offset=0" 
> http://localhost:8080/entityhub/site/dbpedia/find
> {code}
> The first results are "United States Navy" and "United States Air Force" and 
> finally "United States" comes in the third position. See the attached JSON 
> output.
> Exact name match (or close to exact matches) should get a score boost. This 
> can probably be implemented with FuzzyQuery and minSimilarity of 0.8f for 
> instance.
> https://lucene.apache.org/java/3_3_0/api/all/org/apache/lucene/search/FuzzyQuery.html
> Maybe in this case the popularity boost are bad because of the naive incoming 
> links. Using a Page Rank style centrality score might work better in this 
> case:
> https://github.com/julienledem/Pig-scripting-examples/tree/master/Page%20Rank
> https://github.com/mesos/spark/blob/master/bagel/src/main/scala/spark/bagel/examples/WikipediaPageRank.scala

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to