Re: Study Group (WAS Re: Normalized Scoring)

2005-02-07 Thread mark harwood
There are a series of good course notes from the Stanford course on IR: http://www.stanford.edu/class/cs276/handouts/lecture1.pdf to http://www.stanford.edu/class/cs276/handouts/lecture16.pdf These are from the course by Hinrich Schutze who co-authored "Foundations of Statistical Natural Languag

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread mark harwood
>>For example, the RPC made by its > rewrite() > implementation could also return the docFreq() of > each term in the > rewritten query I haven't been following the remoting conversation in detail bit this may be relevant: Using the associated docFreq of each expanded term is not particularly be

More fuzzy issues - encouraging bad spelling?

2004-12-23 Thread mark harwood
Another thought on fuzzy scoring: shouldn't all these queries which automatically expand terms favour common words over rare ones? The default scoring behaviour at the moment favours rare words. As a user aren't I more likely to be looking for the most common expansions? If I'm not sure how to sp

FuzzyQuery scoring

2004-12-23 Thread mark harwood
Should we change the scoring behaviour of FuzzyQuery? The current approach of turning Foo~ into a large boolean query means that result scores are heavily diluted for matches. In my tests a search for Foo returns documents containing Foo with a score of 1. A search for Foo~ returns documents con

Re: potential new Lucene logo

2004-12-13 Thread mark harwood
I just tried closing the loop on the "e"s in the new logo and I think it looks a lot better for it - it looks a lot less like the "c" ___ Win a castle for NYE with your mates and Yahoo! Messenger http://uk.messenger.yahoo

Re: sandbox -> core ?

2004-10-08 Thread mark harwood
I have an updated version of that MoreLikeThis class with a couple of bug fixes and an optimisation. Where do you want me to put it? As for the Highlighter I'd personally be happy for it to move into core because it would avoid a lot of the "how do I get it/build it" questions that routinely crop

Re: Term highlighting and Term vector patch

2004-09-16 Thread mark harwood
>>You always have to maintain two versions of the highlighter This shouldn't be necessary, the highlighter code works with any TokenStream - this should offer a suitable abstraction from the source of data (reanalysis or stored offsets). The only thing you would need to do would be to prov

Re: Notes on distributed searching with Lucene

2002-03-26 Thread Mark Harwood
Good to see that this has promoted some discussion :) Here is some feedback on some of the questions this has raised: 1) My application is able to parition the indexes according to some application-specific data. Each application would have to have its own scheme for partitioning. 2) The stubs

Notes on distributed searching with Lucene

2002-03-25 Thread Mark Harwood
I have written up some of my experiences with creating a distributed system with Lucene here: http://home.clara.net/markharwood/lucene/ It includes some UML interaction diagrams that I found useful in understanding the Lucene codebase. Cheers Mark -- To unsubscribe, e-mail: