To add to what Mark is saying, it's very important that watch out for the first N results effect. If you showed a user a random set of documents with crap relevance I'll bet you that a good number will click on the first result (call it user laziness or the Google "I'm feeling lucky" effect :)). You can a/b results with some entropy or try determine your own result position normalizers.
You could also have your own doc id that is stable and you mark documents maybe a md5 of the title and then have an external boost file that has query-to-doc. Then on the query you boost result documents accordingly. -M On Sun, Dec 23, 2007 at 2:15 AM, mark harwood <[EMAIL PROTECTED]> wrote: > Thanks for the context - much more useful. > The challenge here is similar to that posed by offering end-user tagging > of content (see here > http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). > The main difference here being that words are added to docs implicitly by > search click-throughs rather than any explicit tagging action. > > In both cases the challenge is that the user data around documents is > likely to be updated very often while the documents remain relatively > static. > I suspect some additional things to think about are: > 1) Cancelling out the "human laziness" bias that favours clicking results > on page 1. Are clicks on page 2 worth more? > 2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm. > 3) Lucene doc IDs are not stable - how will you associate query > terms/click data with documents and join them at speed? > 4) Are individual words or phrases the unit of boost? "Paris" means > different things in "Paris Hilton" and "Paris, France". > > A simple approach might be to re-index your content with all of the > additional search terms from clicks added to the associated document in a > "searchClicks" field - the more clicks, the more repetitions of the same > search words in the document to help with tf (Term Frequency). This > additional content would need to be capped, to avoid huge documents. This > has the disadvantage of requiring a re-index though. > Another option to avoid reindexing everything is to wrap IndexReader (See > FilterIndexReader) and implement TermEnum/TermDocs for a fake field called > "searchClicks". The idea is Lucene looks after the usual, static document > content while your implementation goes off to your more volatile storage ( > e.g. database/parallel index, custom file structure) to retrieve lists of > doc ids, term frequencies etc. for this "searchClicks" field. All of the > Lucene queries you might want to throw at this e.g. PhraseQueries can then > test both the static Lucene fields and your new volatile "click" fields > without being aware of this low-level trickery. > > I'm sure there will be other ways of doing this too but this seems like a > conceptually clean way of modelling it - just seeing search terms as > extensions to the document content. > > Cheers > Mark > > > ----- Original Message ---- > From: sumittyagi <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Sunday, 23 December, 2007 5:30:55 AM > Subject: Re: Which file in the lucene package is used to manipulate > results.. > > > Actually what i have to do is... > 1.) for every query(keyword), among the results obtained, the keyword > will > be mapped with the page clicked, along with the no. of clicks for that > keyword on that page > 2.) next time for the same query(keyword), the mapped pages will be > ranked > higher considering the no. of clicks too.. > 3.) for every new query these steps will be repeated... > this was a very high level view , i have made algorithms for these > modules > and trying to incorporate with lucene but dont know , on which files i > have > to do edition to make it work... > please help me regarding this, if you need some more explanation, > please let > me know... > thanks > Sumit Tyagi > > > > > > Erick Erickson wrote: > > > > You still haven't explained *why* you want to rerank results. What > > is the use-case you're trying to implement? Quite often it's turned > > out for me that when I let folks on the list know what the use > > case I'm trying to support is, they come up with much more elegant > > solutions than I was thinking about. > > > > For instance, does the CustomScoreQuery class have any relevance > > to your problem? > > > > If you're thinking of modifying the core Lucene code for your > > special purpose, I'd advise against it unless and until you'd > exhausted > > all the other options. It's always a maintenance headache to do this. > > > > Best > > Erick > > > > On Dec 21, 2007 10:09 AM, sumittyagi <[EMAIL PROTECTED]> wrote: > > > >> > >> actually i am writing a module to rerank the results, so i want to > edit > >> the > >> file which arrange the results and give them ranks, > >> or is there any other way i can use my module to rerank the results > >> > >> > >> markharw00d wrote: > >> > > >> > I think you need to describe your "factors" in more detail. > Exactly > >> what > >> > do you want to achieve for your users? > >> > We could be talking about any number of Lucene functions here. > >> > > >> > ----- Original Message ---- > >> > From: sumittyagi <[EMAIL PROTECTED]> > >> > To: java-user@lucene.apache.org > >> > Sent: Friday, 21 December, 2007 4:51:09 AM > >> > Subject: Which file in the lucene package is used to manipulate > >> results.. > >> > > >> > > >> > hi, i am using lucene for the very first time and want to > manipulate > >> > the > >> > results, by adding some more factors to it, which file should i > edit to > >> > manipulate the search results.... > >> > > >> > Thanks > >> > Sumit Tyagi > >> > -- > >> > View this message in context: > >> > > >> > > >> > > http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html > >> > Sent from the Lucene - Java Users mailing list archive at > Nabble.com. > >> > > >> > > >> > > >> > > >> > > >> > __________________________________________________________ > >> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com > >> > > >> > > >> > > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> > For additional commands, e-mail: [EMAIL PROTECTED] > >> > > >> > > >> > > >> > >> -- > >> View this message in context: > >> > > http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html > >> Sent from the Lucene - Java Users mailing list archive at > Nabble.com. > >> > >> > >> > --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > View this message in context: > > http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > __________________________________________________________ > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >