Re: index architectures

Erick Erickson Wed, 18 Oct 2006 06:24:14 -0700

No, you've got that right. But there's something I think you might be able
to try. Fair warning, I'm remembering things I've read on this list and my
memory isn't what it used to be <G>....


I *think* that if you reduce your result set by, say, a filter, you might
drastically reduce what gets sorted. I'm thinking of something like this
BooleanQuery bq = new BooleanQuery();
bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST)
bq.add(all the rest of your stuff).

RangeFilter might work for you here.

Even if this works, you'll still have to deal with making the range big
enough to do what you want. Perhaps an iterative approach, say the first
time you run the query and you don't get your 25 (or whatever) results,
increase the range and try again.

Again, I'm note entirely sure when the filter gets applied, before or after
the sort. Nor am I sure how to tell. I'd sure like you to do the work and
tell me how <G>.... I *am* sure that this has been discussed in this mailing
list, so a search there might settle this....

C'mon Chris, Erik and Yonki, can't you recognize a plea for help when you
read it<G>?

Although here's yet another thing that flitted through my mind. Is date
really the same as doc ID order? And would you be able to sort on DocID
instead? And would it matter <G>? If you're adding your documents as they
come in, this might work. Doc IDs change, but I *believe* if doc A is added
after doc B, the doc ID for A will always be less than the docID for B,
although neither of them is guaranteed to be the same between index
optimizations. Again, not sure if this helps at all.....

Good luck!
Erick

On 10/18/06, Paul Waite <[EMAIL PROTECTED]> wrote:

Many thanks to Erik and Ollie for responding - a lot of ideas and I'll
have
my work cut out grokking them properly and thinking about what to do.
I'll respond further as that develops.

One quick thing though - Erik wrote:

> So, I wonder if your out of memory issue is really related to the number
> of requests you're servicing. But only you will be able to figure that
> out <G>. These problems are...er...unpleasant to track down...

Indeed!

> I guess I wonder a bit about what large result sets is all about. That
> is, do your users really care about results 100-10,000 or do they just
> want to page through them on demand?

No they don't want that. They just want a small number. What happens is
they enter some silly query, like searching for all stories with a single
common non-stop-word in them, and with the usual sort criterion of by date
(ie. a field) descending, and a limit of, say 25.

So Lucene then presumably has to haul out a massive resultset, sort it,
and
return the top 25 (out of 500,000 or whatever).

Isn't that how it goes? Or am I missing something horribly obvious.

Cheers,
Paul.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index architectures

Reply via email to