On Sat, 2006-05-27 at 01:17 +0200, karl wettin wrote:
> Will report back with results in a month or so. so.

Here is a report on my very simple sitegiest:

I have about 200,000 documents in my corpus. All search results are
passed on the SiteGeist-class that contains a Map<Content, Double>,
where the double represent the total score of the top n (5 in my case,
quite unscientific) results. This value is updated by a secondary thread
to avoid synchronization or and loss of data.

It's not too bad. Most important, it is fast.

However, since I use the results and not the query I have to mine the
data (currently with my eyes) in order to see what people have been
searching for. An example of this could be the results of query "vista".
Does it mean they are looking for "buena vista social club" or
"microsoft vista"? Using my current strategy with a "mean document
result score" it will not work too well, so I'll have to consider a mean
classification of the results before I choose what document to boost.

Another way to go is to track what result people choose to click on.
This is probably much better. It is also worth to consider if the user
found what they are looking for. Given that a corpus is supposed to
contain everything, this would mean something is missing. Or that the
user has bad query skills. I would then have to analyze the query and
keep track of how and if the query is refined. What should go to the
"wish list" and what should be considered a bad query? This might
consume too much resources for my application.

Perhaps a simple thing as mining on a time-axis will help me. I'll be
adding some new dimensions to the statistics and report back in a few
weeks.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to