Ok further to my email below i've been testing with q=radioh?*

Basically the problem is, searching artists even with Radiohead having a big boost, it's returning stuff with less boost before like "Radiohead+Ani Di Franco" or "Radiohead+Michael Stipe"

The debug output is below, but basically, for Radiohead and one of the others we get this:

radiohead+ani - 655391.5  * 0.046359334
radiohead     - 1150991.9 * 0.025442434

So it's fairly clear where is the difference. Looking at the numbers, the cause seems to be in this line:

8.781371 = idf(docFreq=4096)

While Radiohead+Ani is getting

16.000769 = idf(docFreq=2)

If I can alter this I think sorted.. what's idf and docFreq?


  <str name="id=1200360,internal_docid=159496">
30383.514 = (MATCH) sum of:
  30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
    0.046359334 = queryWeight(text:radiohead+ani), product of:
      16.000769 = idf(docFreq=2)
      0.0028973192 = queryNorm
655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496), product of:
      1.0 = tf(termFreq(text:radiohead+ani)=1)
      16.000769 = idf(docFreq=2)
      40960.0 = fieldNorm(field=text, doc=159496)
</str>
  <str name="id=979,internal_docid=9799640">
29284.035 = (MATCH) sum of:
  29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
    0.025442434 = queryWeight(text:radiohead), product of:
      8.781371 = idf(docFreq=4096)
      0.0028973192 = queryNorm
    1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
      1.0 = tf(termFreq(text:radiohead)=1)
      8.781371 = idf(docFreq=4096)
      131072.0 = fieldNorm(field=text, doc=9799640)
</str>

Thanks a lot,

galo


galo wrote:
I was doing a different trick, basically searching q=radioh*+radioh~, and the results are slightly better than ?*, but not great. By the way, the case sensitiveness of wildcards affects here of course.

I'd like to have a look to that DisMax you have if you can post it, at least to compare results. The way I get to do scoring as I say is far from perfect.

By the way, I'm seeing the highlighting dissapears when using these wildcards, is that normal??

Thanks for your help,

galo

At 4:40 PM +0100 6/6/07, galo wrote:
>1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)??

DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like

    Santa Clause: Is he Real (and if so, what is "real")?

and get meaningful results. To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators:

    \ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish.

Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery.

In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's). When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP).

Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list:

>one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a "?" before the "*"
 >
 >ie: auto?* instead of auto*
 >
 >(yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.






Reply via email to