Re: Getting count of documents matching a query?

Chris Hostetter Fri, 07 Apr 2006 11:15:03 -0700

first off: you should double check the correctness ofyour customized
similarity class.  I'm pretty sure it's resulting in a differnet set of
matches then the DefaultSimilarity because your tf function returns 0f
regardless of wether there is a match.  (when i said "every function
returns 0 or 1" i ment you acctually have to have the if (match) return 1
else 0 logic at a minimum)


second: these times really shocked me untill i realized you were
reporting the sum of the execution times for all 10000 iterations (whew!)

third: for queries this simple, i don't think you're going find much
differneces in speed between the differnet models, but if you elliminate
the Similarity as a variable (compare only #1 and #3 for each query), you
can see that HitCollector was fairly consistently "a little faster" then
using Hits (but i'll admit, i thought it would be "more faster")

I think if you tried bigger, more complex queries (nested booleans, with
mandatory/required clauses, span queries, booleans with 20 or more
optional clauses) you might see a bigger discrepency.

it's also not clear how big your test index is, as doug has pointed out
before, it can make a big difference in benchmarking.


: For each query I ran the ran 4 tests of 10,000 searches:
: 1) using hits.length to get the counts and the standard similarity
: 2) using hits.length to get the counts and a custom similarity
: 3) using HitCollector to get the counts and the standard similarity
: 4) using HitCollector to get the counts and a custom similarity
:
: The custom similarity returns 0 for all methods.
: The results are kind of surprising. It doesn't look like the speed up is
: enough to make the change to our application.
:
: Here are the results, the test class is also attached:
:
: time (mills) 14095, useHC=false, standardSimilarity=true, count=47,
: query=abstract_recent:(genetically modified organism)
: time (mills) 15406, useHC=false, standardSimilarity=false, count=0,
: query=abstract_recent:(genetically modified organism)
: time (mills) 13768, useHC=true, standardSimilarity=true, count=47,
: query=abstract_recent:(genetically modified organism)
: time (mills) 14404, useHC=true, standardSimilarity=false, count=47,
: query=abstract_recent:(genetically modified organism)
:
:
: time (mills) 6790, useHC=false, standardSimilarity=true, count=5776,
: query=lname:smith
: time (mills) 4901, useHC=false, standardSimilarity=false, count=0,
: query=lname:smith
: time (mills) 5209, useHC=true, standardSimilarity=true, count=5776,
: query=lname:smith
: time (mills) 5578, useHC=true, standardSimilarity=false, count=5776,
: query=lname:smith
:
:
: time (mills) 47, useHC=false, standardSimilarity=true, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 37, useHC=false, standardSimilarity=false, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 41, useHC=true, standardSimilarity=true, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 198, useHC=true, standardSimilarity=false, count=0,
: query=lname:dfdsalkfjdsalkjflsa
:
:
:
:
: On Thursday 06 April 2006 15:19, Chris Hostetter wrote:
: > : I need the count, and don't need the docs at this point. If I had a
: > : simple query, (e.g. "book") I can use docFreq(), and it's lightning
: > : fast. If I just run it as a query it's much slower. I'm just
: > : wondering if I did a custom scorer / similarity / hitcollector, how
: > : much faster than a query could I get it? Or is there a better way?
: >
: > A custom HitCollector would be the first big win, something like this
: > would probably work...
: >
: >    final int[] count = new int[1]
: >    searcher.search(query, new HitCollector() {
: >        public void collect(int doc, float score) {
: >           count[0]++;
: >        }
: >     });
: >     return count[0]
: >
: > otherways you might be able to shave time would be...
: >
: >   * if your query can be represented as in simple set logic logic (you
: >     don't seem to be concerned with score) then implimenting it as a
: >     Filter may be faster becuase it won't do any score calculation, just a
: >     simple match/no-match (which is what you seem to want) ... but it will
: >     definitely take up more memory then a query
: >
: >   * if you customize your similarity so that every function returns 0 or 1
: >     you might shave a little bit of time off by skipping some of the math
: >     equations ... but i really doubt it.
: >
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting count of documents matching a query?

Reply via email to