Re: HitCollector or Hits

Carlos Pita Thu, 24 May 2007 09:50:31 -0700

Hi Erick,

I don't think that FieldSelector would be that valuable in my case because I
just need to access a few fields, and those are all fields that are in fact
stored (and indexed too). I was thinking of keeping this extra information
in memory, precisely into an array mapping doc ids to the data structure. I
see that this is done for ScoreDocComparator in a Lucene in Action example.
I'm still not sure how to achieve something similar with a HitCollector. I
mean, I could instantiate a maxDoc() size array and index it by the document
ids that are passed to the collector. But that said, I don't know how to
keep this array synchronized with the index. I've opened a new thread for
this subject, "maxDoc and arrays".


Thank you again.
Cheers,
Carlos

On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

You're on the right track. But that said, access to anything that's
indexed (stored or not) should be pretty quick. Things
stored, but not indexed, are costlier. This might drive your
decision on what to index .vs. store.....

Loading the document is anything like IndexReader.document(), or
Hits.doc().

Part of the difference is that if you load the document, you get
all the fields, whether you need them or not.

Also, you can use your own TermEnum/TermDocs lookup for
this kind of thing if the terms you're interested in are indexed...

I wrote a mail some time ago that detailed my experience, in my
situation with my peculiar data set that you may want to read,
see...

Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+,

As I mentioned in that message, I suspect that my improvement was
*highly* dependent upon how the index is structured.....

All that said, your notion of benchmarking is a very good one. It lead
me to using FieldSelector in the first place...

Best
Erick

On 5/24/07, Carlos Pita <[EMAIL PROTECTED] > wrote:
>
> Hi Erick,
>
> thank you for your prompt answer. What do you mean by loading the
> document?
> Accessing one of the stored fields? In that case I'm afraid I would need

> to
> do it. For example, in the aforementioned case of a result of products,
I
> have to look at any product store_id, which is stored along the
document.
> Is
> this a performance killer? Maybe I should keep some tables in memory,
for
> example an array mapping from id to store_id in O(1). I will do some
> benchmarking before anyway.
>
> Cheers,
> Carlos
>
> On 5/24/07, Erick Erickson < [EMAIL PROTECTED]> wrote:
> >
> > I know of no way to alter the Hits behavior, I recommend using
> > a TopDocs/TopDocCollector.
> >
> > But be aware that if you load the document for each one, you may incur

> > a significant penalty, although the lazy-loading helped me a lot, see
> > FieldSelector.....
> >
> > On 5/23/07, Carlos Pita <[EMAIL PROTECTED] > wrote:
> > >
> > > Hi folks,
> > >
> > > I need to collect some global information from my first 1000 search
> > > results
> > > in order to build up some search refining components containing only

> > > relevant values (those which correspond to at least one of the first
> > 1000
> > > hits). For example, the results are products and there is a store
> filter
> > > component that shows only the stores that sells a product between
the
> > > first
> > > 1000 hits. So even if the user sees just the first 20, I would have
to
> > > inspect the first 1000. I've read that Hits mantains a cache of
about
> > 100
> > > or
> > > 200 hits. Is this configurable? If I could set this cache to 1000 I
> > would
> > > then use Hits to browse the search results. Another way, I should
use
> > > HitCollector. What's your advice?
> > >
> > > TIA
> > > Cheers,
> > > Carlos
> > >
> >
>

Re: HitCollector or Hits

Reply via email to