On Wed, 13 Nov 2002 04:01, Clemens Marschner wrote:
> > Do you "serve" the index in any way shape or form or is that done via
> > some other server or just put on a filesystem or whatever?
>
> A Lucene index is put into an abstract "Directory" which can be a
> RAMDirectory or a directory in a file system. Querying the index is not a
> big problem (it's easier as, say, querying a JDBC connection); the client
> directly accesses the index (which is thread safe). There exist some
> applications for that, e.g. a servlet that does querying, and from what I
> know Peter Donald 

I do ? You sure you got the right person there ? :)

> wants to work on that part; most of the work will go to
> the indexing part, though

> > There is the job server part
> > * schedules jobs to happen at certain times
> > * schedules jobs to happen in response to certain events
>
> exactly. A scheduler might be pretty dumb (indexing the whole server at
> 00:00 AM) or more intelligent (being notified of changes in the file system
> and reindexing them when the system is idle).

kool. Theres a few people who have written that sort of thing. I will see if I 
can poke em to put it up somewhere. If not there is always the scheduler 
stuff in cornerstone that could be used as the base of this (it was 
originally written for precisly that).

> Since indexing can be compared with an assembly line, where you get some
> source documents at the start, do some processing, and save the results at
> some drain, one can think that this pipeline can well be made of one piece,
> or be separated into two or more pieces that are connected through "drains"
> and "sources" consisting of some IPC mechanism.

Yep. You may want to check out the even stuff in excalibur and the silk stuff 
in cornerstone that was designed for event based programming. However it has 
not been officially released yet and I am not sure of its status. Ask on the 
dev list and hopefully someone can tell you how ready it is for widespread 
use.

> If you have a large web crawler it is likely that a crawler gets 100
> docs/second or more, and you end up with about 1500 extracted URLs and
> 1-2MB of documents per second. If you have a multi-threaded or
> multi-process system it means synchronization becomes an issue, and it is
> likely that you have to have several queues between the parts and exchange
> data in batch mode (e.g. every couple of seconds).

You may want to have a look at Matt Walshs SEDA research at 

http://www.cs.berkeley.edu/~mdw/proj/sandstorm/

And then look at his sandstorm work. He actually found that event based 
systems have better performance characteristerics in highly concurrent 
scenarios and there is nicer degradation of quality. 

QOS may not be an issue for much besides the retriever though so not sure ;)

>
> We thought about using local queues for communication between threads, and
> JMS queues for communication between processes.
> In the end we want to provide different configurations for different needs:
> One file indexer that indexes a file system, an intranet crawler, or a
> large web crawler. We could also think of a database indexer that provides
> full-text indices for database tables.

kool.

-- 
Cheers,

Peter Donald
----------------------------------------
"Liberty means responsibility. That is 
      why most men dread it." - Locke
---------------------------------------- 


--
To unsubscribe, e-mail:   <mailto:avalon-users-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:avalon-users-help@;jakarta.apache.org>

Reply via email to