Re: raw hit count

Kent Gibson Sun, 30 Nov 2003 09:17:20 -0800

Thanks for the help guys, but unfortunately I am still
stuck. Let me reiterate what I would like to do and
then explain what I have tried.


I would like to know that in document x the query y
appeared n times. 

For example:

query = "Bank" : Bank found in doc number 1, 3 tims

Understandbly this is a bit tricky when query y is
composed of more than one word, but for the moment I
would be satisified if I knew how many times query y
appeared in its entirety.

However in the end it would be great if I could get a
result as follows: 
query = "Hells Bells"; Hells found in doc number 2, 3
times and Bells Found 0 times 

as per Erik's idea I tried with the BitSet as follows:

QueryFilter qf = new QueryFilter(query);
IndexReader ir = IndexReader.open(indexPath);
Searcher searcher2 = new IndexSearcher(ir);

// get the bit set for the query
BitSet bits = qf.bits(ir);
last = bits.nextSetBit(offset);
offset = last + 1;

System.out.println("First bit is: " + last);    
System.out.println("Bits " + bits.toString());
                
// clear all the bits
bits.clear();
System.out.println("Bits after " + bits.toString());
bits.set(last);
                
/* just to see the effect */    
BitSet bits2 = qf.bits(ir);
System.out.println("Bits now " + bits2.toString());
                
Hits hits2 = searcher2.search(query,qf);
/* this value is always one /*
  */    
System.out.println("raw hits : " + hits2.length());

However I always get a result of 1, which I suppose is
has to do with this overlap thingy.

As per Ype's idea I tried to implement a Similarity
object, but two things I believe are wrong, a) I am
doing something fundamentally wrong with the maths b)
I get a sneaky idea this is the wrong way around this.

Is there not a simple way to just get some word
statistics out of a file?

Once again thanks for the inputs and I look forward to
a long fight.

public float lengthNorm(String fieldName, int
numTerms)
{
return (float) 1.0 ;
}

/** Implemented as <code>sqrt(freq)</code>. */
public float tf(float freq)
{
return (float) (freq);
}

/** Implemented as <code>log(numDocs/(docFreq+1)) +
1</code>. */
public float idf(int docFreq, int numDocs)
{
return (float)1.0;

}
--- Ype Kingma <[EMAIL PROTECTED]> wrote:
> Kent, Erik,
> 
> On Saturday 29 November 2003 17:20, Erik Hatcher
> wrote:
> > I enjoy at least attempting to answer questions
> here, even if I'm half
> > wrong, so by all means correct me if I
> misspeak....
> 
> Me too, :)
> 
> > On Saturday, November 29, 2003, at 06:37  PM, Kent
> Gibson wrote:
> > > All I would like to know is how many times a
> query was
> > > found in a particular document. I have no
> problems
> > > getting the score from hits.score(). hits.length
> is
> > > the number of times in total that the query was
> found,
> > > however I want the the number of times the query
> was
> > > found on a document by document basis. is this
> > > possible?
> 
> Could you be a bit more precise on what you mean
> by 'the number of times the query was found'? For a
> single
> query term, it is straightforward, but what about
> eg. a query for three
> optional terms?
> 
> >
> > The 'coord' factor used in computing the score is
> exactly this.  See
> > the javadoc for it:
> >
> > 
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
> > Similarity.html#coord(int,%20int)
> 
> AFAIK, this overlap is the number of terms the
> document and the query
> have in common.
> For a query consisting of a single term, the overlap
> is always one,
> and the number of times the query occurs in a
> document is the term frequency
> in the document.
> 
> > You could implement a custom Similarity to capture
> the "overlap" or
> > adjust the the factor depending on what you're
> trying to accomplish.
> >
> > >  The only idea I have is to rerun the search,
> > > but I can't even see how to run a search on only
> one
> > > document!
> >
> > You could always rerun a search with a Filter with
> only one bit enabled
> > and see if zero or one document is returned - that
> would be quite
> > trivial and fast.
> 
> You could also implement a Similarity that ignores
> the total number
> of terms in the searched document field, see
> lengthNorm() in
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
> As lengthNorm() is applied at indexing time, you
> will have to reindex
> for this to work for you.
> At query time you can then use a tf() implementation
> that is linear, instead
> of the default square root in DefaultSimilarity, and
> a constant idf(),
> instead of the default log of the inverse document
> frequency.
> You should then get a document score that is
> proportional
> to the number of query terms in the document.
> 
> Kind regards,
> Ype
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: raw hit count

Reply via email to