Controlling Hits

2006-11-24 Thread Otis Gospodnetic
Hi,

Could we make Hits non-final, or at least expose something in Hits to control 
the number of Documents it reads from disk?
I'm looking at:
  getMoreDocs(50); // retrieve 100 initially
...
  int n = min * 2;// double # retrieved
(Why * 2?  To "read and cache ahead"?)

... and wondering how I could control those two numbers.  If reading Documents 
out of the index is the thing to minimize, then by changing/decreasing that 50 
and maybe eliminating that * 2, I would hit my disk a little less, no?  Maybe 
something as simple as this will do:

  public static void fetchDocs(int)

Or maybe the answer is: Use the search method that returns TopDocs if you want 
more control...?

Thanks,
Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Controlling Hits

2006-11-24 Thread Marvin Humphrey


On Nov 24, 2006, at 10:38 PM, Otis Gospodnetic wrote:

Could we make Hits non-final, or at least expose something in Hits  
to control the number of Documents it reads from disk?

I'm looking at:
  getMoreDocs(50); // retrieve 100 initially
...
  int n = min * 2;// double # retrieved
(Why * 2?  To "read and cache ahead"?)

... and wondering how I could control those two numbers.  If  
reading Documents out of the index is the thing to minimize, then  
by changing/decreasing that 50 and maybe eliminating that * 2, I  
would hit my disk a little less, no?  Maybe something as simple as  
this will do:


  public static void fetchDocs(int)


FWIW, in KinoSearch, this is the interface:

  my $hits = $searcher->search( query => $query );
  $hits->seek( $offset, $num_wanted );
  while ( my $hit = $hits->fetch_hit_hashref ) {
   print "$hit->{title}\n";
  }

The actual scoring doesn't happen until seek() gets called.  That  
way, it's possible to know exactly how many hits need to be  
collected, by adding $offset and $num_wanted.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Controlling Hits

2006-11-26 Thread Nadav Har'El
On Fri, Nov 24, 2006, Otis Gospodnetic wrote about "Controlling Hits":
> Hi,
> 
> Could we make Hits non-final, or at least expose something in Hits to control 
> the number of Documents it reads from disk?
>...
> Or maybe the answer is: Use the search method that returns TopDocs if you 
> want more control...?

In an application I was writing, I was facing similar issues: "Hits" was fine
for a short Demo in Lucene, but when it came to a real application, it didn't
give me enough control: it reran the search too many times when you wanted
to see, e.g, the 20th result page, and wouldn't allow me adding a HitCollector
which I needed. I started by modifying Hits (which wasn't just final - much
of its functionality was private), but then realized: there's simply no
reason to use Hits! IndexSearcher.search() which returns TopDocs already
gives you full control, and frankly isn't that much harder to use.

In fact, I fail to see a situation where "Hits"'s concept of "random access"
to the results (you can ask for result #30 and then #70) even makes sense.
In all search applications I'm familar with, at the time you call search(),
you already know how many results you want to display - and you don't need
someone to guess for you that you need 50 results, and if that's not enough
then you need 100 results, and then 200, and so on.
And since this concept of "random access" is what differenciates Hits from
TopDocs, perhaps we don't need Hits at all?

So, how about deprecating Hits altogether, and recommending the TopDocs
alternatives instead?

-- 
Nadav Har'El|   Sunday, Nov 26 2006, 5 Kislev 5767
IBM Haifa Research Lab  |-
|God created the world out of nothing, but
http://nadav.harel.org.il   |the nothingness still shows through.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Controlling Hits

2006-11-26 Thread Otis Gospodnetic
Heh, brave! :)  I haven't used TopDocs enough to feel this strong against Hits 
yet.
But while we are at it, what I think I really want is what Marvin does in 
KinoSearch:

   my $hits = $searcher->search( query => $query );
   $hits->seek( $offset, $num_wanted );

This suits my experience and typical use of Lucene.  I always know which "page" 
of results I want, and how many matches I want to show per page, so I always 
know the offset and always know how many matches after that offset I need. If I 
show 10 results per page, and want to get a third page of results, ideally I'd 
do as little work as possible for the first 20 matches, and just get the slice 
I need.  Of course, I'll still need to go through the first 20 and score them, 
but in the end I'll just throw them out.

Otis

- Original Message 
From: Nadav Har'El <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sunday, November 26, 2006 3:07:26 AM
Subject: Re: Controlling Hits

On Fri, Nov 24, 2006, Otis Gospodnetic wrote about "Controlling Hits":
> Hi,
> 
> Could we make Hits non-final, or at least expose something in Hits to control 
> the number of Documents it reads from disk?
>...
> Or maybe the answer is: Use the search method that returns TopDocs if you 
> want more control...?

In an application I was writing, I was facing similar issues: "Hits" was fine
for a short Demo in Lucene, but when it came to a real application, it didn't
give me enough control: it reran the search too many times when you wanted
to see, e.g, the 20th result page, and wouldn't allow me adding a HitCollector
which I needed. I started by modifying Hits (which wasn't just final - much
of its functionality was private), but then realized: there's simply no
reason to use Hits! IndexSearcher.search() which returns TopDocs already
gives you full control, and frankly isn't that much harder to use.

In fact, I fail to see a situation where "Hits"'s concept of "random access"
to the results (you can ask for result #30 and then #70) even makes sense.
In all search applications I'm familar with, at the time you call search(),
you already know how many results you want to display - and you don't need
someone to guess for you that you need 50 results, and if that's not enough
then you need 100 results, and then 200, and so on.
And since this concept of "random access" is what differenciates Hits from
TopDocs, perhaps we don't need Hits at all?

So, how about deprecating Hits altogether, and recommending the TopDocs
alternatives instead?

-- 
Nadav Har'El|   Sunday, Nov 26 2006, 5 Kislev 5767
IBM Haifa Research Lab  |-
|God created the world out of nothing, but
http://nadav.harel.org.il   |the nothingness still shows through.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Controlling Hits

2006-11-27 Thread Mark Miller
The only thing that Hits gives that I want without the expense (Hits is 
expensive to use this way) is that you can do a search and get all of 
the results back with sorting...sorting appears to be built into 
TopDocs, so you don't get it with a HitCollector. If you try and use 
TopDocs instead of hits then you need to know how many docs will 
match...you do not have that info before doing the search...TopDocs 
requires it though (for sorting and non sorting), to initialize its 
priority queues to the correct size. Hits is also nice for normalizing 
scores for you.


- Mark

Otis Gospodnetic wrote:

Heh, brave! :)  I haven't used TopDocs enough to feel this strong against Hits 
yet.
But while we are at it, what I think I really want is what Marvin does in 
KinoSearch:

   my $hits = $searcher->search( query => $query );
   $hits->seek( $offset, $num_wanted );

This suits my experience and typical use of Lucene.  I always know which "page" 
of results I want, and how many matches I want to show per page, so I always know the 
offset and always know how many matches after that offset I need. If I show 10 results 
per page, and want to get a third page of results, ideally I'd do as little work as 
possible for the first 20 matches, and just get the slice I need.  Of course, I'll still 
need to go through the first 20 and score them, but in the end I'll just throw them out.

Otis

- Original Message 
From: Nadav Har'El <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sunday, November 26, 2006 3:07:26 AM
Subject: Re: Controlling Hits

On Fri, Nov 24, 2006, Otis Gospodnetic wrote about "Controlling Hits":
  

Hi,

Could we make Hits non-final, or at least expose something in Hits to control 
the number of Documents it reads from disk?
...
Or maybe the answer is: Use the search method that returns TopDocs if you want 
more control...?



In an application I was writing, I was facing similar issues: "Hits" was fine
for a short Demo in Lucene, but when it came to a real application, it didn't
give me enough control: it reran the search too many times when you wanted
to see, e.g, the 20th result page, and wouldn't allow me adding a HitCollector
which I needed. I started by modifying Hits (which wasn't just final - much
of its functionality was private), but then realized: there's simply no
reason to use Hits! IndexSearcher.search() which returns TopDocs already
gives you full control, and frankly isn't that much harder to use.

In fact, I fail to see a situation where "Hits"'s concept of "random access"
to the results (you can ask for result #30 and then #70) even makes sense.
In all search applications I'm familar with, at the time you call search(),
you already know how many results you want to display - and you don't need
someone to guess for you that you need 50 results, and if that's not enough
then you need 100 results, and then 200, and so on.
And since this concept of "random access" is what differenciates Hits from
TopDocs, perhaps we don't need Hits at all?

So, how about deprecating Hits altogether, and recommending the TopDocs
alternatives instead?

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Controlling Hits

2006-11-27 Thread Chris Hostetter

: In fact, I fail to see a situation where "Hits"'s concept of "random access"
: to the results (you can ask for result #30 and then #70) even makes sense.
: In all search applications I'm familar with, at the time you call search(),
: you already know how many results you want to display - and you don't need
: someone to guess for you that you need 50 results, and if that's not enough
: then you need 100 results, and then 200, and so on.

Hits attempts to do the best it can for a simple clients without forcing
the client to use an API that requires knowing exactly how it will get
used in advance of executing hte search -- this can be practical in a
simple MVC framework where the Contorller may not know exactly how the
View wants to display things; if the Controller is forced to tell the
Searcher how many results it wants and at what offset, then that's one
more place that information has to be used (instead of in just hte View)

as for the random access: imagine you are outputing a multi column listing
of N results per page, using an HTML table: you display # 1, then # N/2+1,
then a row boundry, then # 2, then # N/2+2, etc... likewise imagine you
want to prefetch some data baout each item on your "page" so you iterate
over all of them once to get their image path (or something) then you
iterate over all of them again to display the results.

...there are lots of possible usecases, and that's what Hits is all about:
making it trivially easy for people do do just about anything with a
really simplistic API -- without concern for performance.   once
performance becomes your concern, then you start reconsidering your API,
and have to understand more about the Lucene internals to make the right
choice for your application.

I hate the Hits class with a passion -- almost as much as I hate the
RangeQuery class -- i have yet to read/hear a question asked for which i
thought "Hits" was the right answer ... but i don't think we should ever
deprecate it unless it's because we've aded an even *simpler* API ...
Like QueryParser, StandardAnalyzer, IndexMOdifier, and the IndexSearcher
constructor that takes a "String" arguement, Hits provides a super low
barrier to entry that lets people start writting Lucene apps without
needing to understand the complexities of the underlying Lucene APIs



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]