Ok further to my email below i've been testing with q=radioh?*
Basically the problem is, searching artists even with Radiohead having a
big boost, it's returning stuff with less boost before like
"Radiohead+Ani Di Franco" or "Radiohead+Michael Stipe"
The debug output is below, but basically, for Radiohead and one of the
others we get this:
radiohead+ani - 655391.5 * 0.046359334
radiohead - 1150991.9 * 0.025442434
So it's fairly clear where is the difference. Looking at the numbers,
the cause seems to be in this line:
8.781371 = idf(docFreq=4096)
While Radiohead+Ani is getting
16.000769 = idf(docFreq=2)
If I can alter this I think sorted.. what's idf and docFreq?
<str name="id=1200360,internal_docid=159496">
30383.514 = (MATCH) sum of:
30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
0.046359334 = queryWeight(text:radiohead+ani), product of:
16.000769 = idf(docFreq=2)
0.0028973192 = queryNorm
655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496),
product of:
1.0 = tf(termFreq(text:radiohead+ani)=1)
16.000769 = idf(docFreq=2)
40960.0 = fieldNorm(field=text, doc=159496)
</str>
<str name="id=979,internal_docid=9799640">
29284.035 = (MATCH) sum of:
29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
0.025442434 = queryWeight(text:radiohead), product of:
8.781371 = idf(docFreq=4096)
0.0028973192 = queryNorm
1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
1.0 = tf(termFreq(text:radiohead)=1)
8.781371 = idf(docFreq=4096)
131072.0 = fieldNorm(field=text, doc=9799640)
</str>
Thanks a lot,
galo
galo wrote:
I was doing a different trick, basically searching q=radioh*+radioh~,
and the results are slightly better than ?*, but not great. By the way,
the case sensitiveness of wildcards affects here of course.
I'd like to have a look to that DisMax you have if you can post it, at
least to compare results. The way I get to do scoring as I say is far
from perfect.
By the way, I'm seeing the highlighting dissapears when using these
wildcards, is that normal??
Thanks for your help,
galo
At 4:40 PM +0100 6/6/07, galo wrote:
>1. I want to use solr for some sort of live search, querying with
incomplete terms + wildcard and getting any similar results. Radioh*
would return anything containing that string. The DisMax req. hander
doesn't accept wildcards in the q param so i'm trying the simple one
and still have problems as all my results are coming back with score =
1 and I need them sorted by relevance.. Is there a way of doing this?
Why doesn't * work in dismax (nor ~ by the way)??
DisMax was written with the intent of supporting a simple search box
in which one could type or paste some text, e.g. a title like
Santa Clause: Is he Real (and if so, what is "real")?
and get meaningful results. To do that it pre-processes the query
string by removing unbalanced quotation marks and escaping characters
that would otherwise be treated by the query parser as operators:
\ ! ( ) : ^ [ ] { } ~ * ?
I have a local version of DisMax which parameterizes the escaping so
certain operators can be allowed through, which I'd be happy to
contribute to you or the codebase, but I expect SimpleRH may be a
better tool for your application than DisMaxRH, as long as you get it
to score as you wish.
Both Standard and DisMax request handlers use SolrQueryParser, an
extension of the Lucene query parser which introduces a small number
of changes, one of which is that prefix queries e.g. Radioh* are
evaluated with ConstantScorePrefixQuery rather than the standard
PrefixQuery.
In issue SOLR-218 developers have been discussing per-field control of
query parser options (some of it Solr's, some of it Lucene's). When
that is implemented there should additionally be a property
useConstantScorePrefixQuery analogous to the unfortunately-named
QueryParser useOldRangeQuery, but handled by SolrQueryParser (until
CSPQs are implemented as an option in Lucene QP).
Until that time, well, Chris H. posted a clever and rather timely
workaround on the solr-dev list:
>one work arround people may want to consider ... is to force the use
of a WildCardQuery in what would otherwise be interpreted as a
PrefixQuery by putting a "?" before the "*"
>
>ie: auto?* instead of auto*
>
>(yes, this does require that at least one character follow the prefix)
Perhaps that would help in your case?
- J.J.