Thanks for the help guys, but unfortunately I am still stuck. Let me reiterate what I would like to do and then explain what I have tried.
I would like to know that in document x the query y appeared n times. For example: query = "Bank" : Bank found in doc number 1, 3 tims Understandbly this is a bit tricky when query y is composed of more than one word, but for the moment I would be satisified if I knew how many times query y appeared in its entirety. However in the end it would be great if I could get a result as follows: query = "Hells Bells"; Hells found in doc number 2, 3 times and Bells Found 0 times as per Erik's idea I tried with the BitSet as follows: QueryFilter qf = new QueryFilter(query); IndexReader ir = IndexReader.open(indexPath); Searcher searcher2 = new IndexSearcher(ir); // get the bit set for the query BitSet bits = qf.bits(ir); last = bits.nextSetBit(offset); offset = last + 1; System.out.println("First bit is: " + last); System.out.println("Bits " + bits.toString()); // clear all the bits bits.clear(); System.out.println("Bits after " + bits.toString()); bits.set(last); /* just to see the effect */ BitSet bits2 = qf.bits(ir); System.out.println("Bits now " + bits2.toString()); Hits hits2 = searcher2.search(query,qf); /* this value is always one /* */ System.out.println("raw hits : " + hits2.length()); However I always get a result of 1, which I suppose is has to do with this overlap thingy. As per Ype's idea I tried to implement a Similarity object, but two things I believe are wrong, a) I am doing something fundamentally wrong with the maths b) I get a sneaky idea this is the wrong way around this. Is there not a simple way to just get some word statistics out of a file? Once again thanks for the inputs and I look forward to a long fight. public float lengthNorm(String fieldName, int numTerms) { return (float) 1.0 ; } /** Implemented as <code>sqrt(freq)</code>. */ public float tf(float freq) { return (float) (freq); } /** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */ public float idf(int docFreq, int numDocs) { return (float)1.0; } --- Ype Kingma <[EMAIL PROTECTED]> wrote: > Kent, Erik, > > On Saturday 29 November 2003 17:20, Erik Hatcher > wrote: > > I enjoy at least attempting to answer questions > here, even if I'm half > > wrong, so by all means correct me if I > misspeak.... > > Me too, :) > > > On Saturday, November 29, 2003, at 06:37 PM, Kent > Gibson wrote: > > > All I would like to know is how many times a > query was > > > found in a particular document. I have no > problems > > > getting the score from hits.score(). hits.length > is > > > the number of times in total that the query was > found, > > > however I want the the number of times the query > was > > > found on a document by document basis. is this > > > possible? > > Could you be a bit more precise on what you mean > by 'the number of times the query was found'? For a > single > query term, it is straightforward, but what about > eg. a query for three > optional terms? > > > > > The 'coord' factor used in computing the score is > exactly this. See > > the javadoc for it: > > > > > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ > > Similarity.html#coord(int,%20int) > > AFAIK, this overlap is the number of terms the > document and the query > have in common. > For a query consisting of a single term, the overlap > is always one, > and the number of times the query occurs in a > document is the term frequency > in the document. > > > You could implement a custom Similarity to capture > the "overlap" or > > adjust the the factor depending on what you're > trying to accomplish. > > > > > The only idea I have is to rerun the search, > > > but I can't even see how to run a search on only > one > > > document! > > > > You could always rerun a search with a Filter with > only one bit enabled > > and see if zero or one document is returned - that > would be quite > > trivial and fast. > > You could also implement a Similarity that ignores > the total number > of terms in the searched document field, see > lengthNorm() in > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html > As lengthNorm() is applied at indexing time, you > will have to reindex > for this to work for you. > At query time you can then use a tf() implementation > that is linear, instead > of the default square root in DefaultSimilarity, and > a constant idf(), > instead of the default log of the inverse document > frequency. > You should then get a document score that is > proportional > to the number of query terms in the document. > > Kind regards, > Ype > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __________________________________ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]