I haven't really looked too much at the search code or done much work
with Lucene, but I would agree that it doesn't seem to make much sense
to try and update the index immediately every time someone posts a new
entry. I think a scheduled task that simply updates the entire index
periodically, maybe every 30-60 minutes, would be fine.
-- Allen
Rudman Max wrote:
When you say "internal search engine" do you mean Lucene? I presume
by "external" you mean something like Google? At any rate, I think
the problem is either in Lucene or in that "edu.oswego" library you
are using to schedule index writes. I think what's happening is that
at sustained high rate of posts the threads wanting to write to
Lucene index are piling up and causing the problem. Sadly, I don't
think concurrency level needed to expose this bug is very high at
all. I've been able to reproduce this with as few as 5 concurrent
users making posts.
My fear is that this is something fundamental to the way Lucene
manages its indexes and I suspect it's not an easy fix. I mean this
is the kind of problem that database vendors have to deal with pretty
sophisticated solutions. How receptive would you (and the community)
be to a proposal for changing real-time writes to some sort of
batched mode where a process is run periodically which is responsible
for indexing un-indexed entries?
Max
On Jun 28, 2005, at 10:52 PM, Allen Gilliland wrote:
I'm not sure if we've ever isolated things down to exactly that
problem, but for blogs.sun.com we've definitely had a number of
problems with the built in search engine. I believe a number of the
problems have been fixed, so maybe if you aren't using the latest
cvs already then you can run your tests against the 1.2 release
coming up and see what happens. Unfortunately we won't be the best
help with search problems because we use an external search engine
and so we currently have Roller's built in search disabled.
-- Allen
Rudman Max wrote:
We've been testing Roller with some pretty high load (about 500
concurrent users) running search (which reads from the index) and
posts (which writes to the index) transactions. After a while,
we'd run out of file handles which froze Tomcat because it could
no longer open sockets to accept incoming connections. Our
sysadmin told me there were a bunch of orphaned file descriptors
on files in /roller- index directory. I am not sure if that's the
reason for Tomcat process running out of files but it seems
likely. Has anybody ever experienced this problem with Roller
search index?
Max