On Sat, 2006-05-27 at 01:17 +0200, karl wettin wrote: > Will report back with results in a month or so. so.
Here is a report on my very simple sitegiest: I have about 200,000 documents in my corpus. All search results are passed on the SiteGeist-class that contains a Map<Content, Double>, where the double represent the total score of the top n (5 in my case, quite unscientific) results. This value is updated by a secondary thread to avoid synchronization or and loss of data. It's not too bad. Most important, it is fast. However, since I use the results and not the query I have to mine the data (currently with my eyes) in order to see what people have been searching for. An example of this could be the results of query "vista". Does it mean they are looking for "buena vista social club" or "microsoft vista"? Using my current strategy with a "mean document result score" it will not work too well, so I'll have to consider a mean classification of the results before I choose what document to boost. Another way to go is to track what result people choose to click on. This is probably much better. It is also worth to consider if the user found what they are looking for. Given that a corpus is supposed to contain everything, this would mean something is missing. Or that the user has bad query skills. I would then have to analyze the query and keep track of how and if the query is refined. What should go to the "wish list" and what should be considered a bad query? This might consume too much resources for my application. Perhaps a simple thing as mining on a time-axis will help me. I'll be adding some new dimensions to the statistics and report back in a few weeks. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]