On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
Hi Grant,
It's true, I may have an X-Y problem here. =)
My basic need is to sacrifice recall to achieve greater precision.
Rather
than always presenting the user with the top N documents, I need to
return
*only* the documents that seem relevant. For some searches this may
be 3
documents, for some it may be none.
Therein lies the rub. How are you determining what is relevant? In
some sense, you are asking Lucene to determine what is relevant and
then turning around and telling it you are not happy with it doing
what you told it to do (I'm exaggerating a bit, I know), namely tell
you what the relevant documents are for a given query and a set of
documents based on it's scoring model. As an alternate tack, I
usually look at this type of thing and try to figure out a way to make
my queries more precise (e.g. replace OR with AND, introduce phrase
queries, filter or add NOT clauses or some other qualifiers) or some
other relevance tricks [1], [2].
That being said, I could see maybe determining a delta value such that
if the distance between any two scores is more than the delta, you cut
off the rest of the docs. This takes into account the relative state
of scores and is not some arbitrary value (although, the delta is, of
course)
Since you are allowing the user to "explore", it may be more
reasonable to cutoff at some point, too, but I still don't know of a
good way to determine what that point is in a generic way. Maybe with
some specific knowledge about how you are creating your queries and
what query terms matched you could come up with something, but still,
I am uncertain.
The other thing that strikes me is that you add in some type of
learning/memory component that tracks your click-through information
and gives feedback into the system about relevance.
My user interface in this case isn't the standard "type words in a
box and
we'll show you the best docs" - I'm using Lucene as a tool in the
background
to do some exploration about how I could augment a set of traditional
results with a few alternative results gleaned from a different path.
Not sure if this helps with the X-Y problem, but that's my task at
hand.
Yes.
Also, keep in mind there are other techniques for encouraging
exploration: clustering, faceting, info extraction (identifying named
entities, etc. and presenting them)
Just throwing out some food for thought.
Also, while perusing the threads you refer to below, I saw a
reference to
the following link, which seems to have gone dead:
https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is
anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry.
-Grant
[1]
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/
[2]
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-in-Lucene-and-Solr/
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]