On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:

Hi Grant,

It's true, I may have an X-Y problem here. =)

My basic need is to sacrifice recall to achieve greater precision. Rather than always presenting the user with the top N documents, I need to return *only* the documents that seem relevant. For some searches this may be 3
documents, for some it may be none.

Therein lies the rub. How are you determining what is relevant? In some sense, you are asking Lucene to determine what is relevant and then turning around and telling it you are not happy with it doing what you told it to do (I'm exaggerating a bit, I know), namely tell you what the relevant documents are for a given query and a set of documents based on it's scoring model. As an alternate tack, I usually look at this type of thing and try to figure out a way to make my queries more precise (e.g. replace OR with AND, introduce phrase queries, filter or add NOT clauses or some other qualifiers) or some other relevance tricks [1], [2].

That being said, I could see maybe determining a delta value such that if the distance between any two scores is more than the delta, you cut off the rest of the docs. This takes into account the relative state of scores and is not some arbitrary value (although, the delta is, of course)

Since you are allowing the user to "explore", it may be more reasonable to cutoff at some point, too, but I still don't know of a good way to determine what that point is in a generic way. Maybe with some specific knowledge about how you are creating your queries and what query terms matched you could come up with something, but still, I am uncertain.

The other thing that strikes me is that you add in some type of learning/memory component that tracks your click-through information and gives feedback into the system about relevance.



My user interface in this case isn't the standard "type words in a box and we'll show you the best docs" - I'm using Lucene as a tool in the background
to do some exploration about how I could augment a set of traditional
results with a few alternative results gleaned from a different path.

Not sure if this helps with the X-Y problem, but that's my task at hand.

Yes.

Also, keep in mind there are other techniques for encouraging exploration: clustering, faceting, info extraction (identifying named entities, etc. and presenting them)

Just throwing out some food for thought.



Also, while perusing the threads you refer to below, I saw a reference to
the following link, which seems to have gone dead:

 https://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry.

-Grant

[1] 
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/
[2] 
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-in-Lucene-and-Solr/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to