> Do you "serve" the index in any way shape or form or is that done via some
> other server or just put on a filesystem or whatever?

A Lucene index is put into an abstract "Directory" which can be a
RAMDirectory or a directory in a file system. Querying the index is not a
big problem (it's easier as, say, querying a JDBC connection); the client
directly accesses the index (which is thread safe). There exist some
applications for that, e.g. a servlet that does querying, and from what I
know Peter Donald wants to work on that part; most of the work will go to
the indexing part, though

> There is the job server part
> * schedules jobs to happen at certain times
> * schedules jobs to happen in response to certain events

exactly. A scheduler might be pretty dumb (indexing the whole server at
00:00 AM) or more intelligent (being notified of changes in the file system
and reindexing them when the system is idle).

> There is retrieval component
> * retrieves documents
> * caches documents?
> * converts documents?

Sure, if you safe XML documents in the index, conversion steps may be
necessary.
But conversion will mostly be done at indexing time. Lucene stores documents
in "Fields" which consist of "Tokens". These form the inverted index that is
queried. In the tokenization process it is defined which parts of a document
form tokens (e.g. words), in which field they go (e.g. "body" or "author"),
and may involve language specific processing (e.g. "stemming").

> And this all feeds into an indexer of some sort. That may or may not be
> exported via a webservice or something?

That's a possible extension, but is currently not at the center of our
efforts. If you want to serve millions of documents, the interfaces must be
_very_ efficient.

I may provide some more preliminary information of what we thought about:

We want the crawler to be scalable up to millions of servers. That is, the
crawler itself must be distributable with different setups (say 3 crawlers
and 2 indexers). But it should also be able to run within one server that
does crawling and indexing at a time, say for an intranet with some 100.000
documents.

Since indexing can be compared with an assembly line, where you get some
source documents at the start, do some processing, and save the results at
some drain, one can think that this pipeline can well be made of one piece,
or be separated into two or more pieces that are connected through "drains"
and "sources" consisting of some IPC mechanism.

In the crawler you actually have two pipelines: one for URLs and one for the
crawled documents. The URLs may be put into the pipeline, where they undergo
several filter steps (e.g. a URL that was already crawled is thrown away by
a "URLSeenFilter"). Once the documents are crawled they are put into the
indexing pipeline, where they may be transformed (say from HTML or PDF) into
an internal format, and then indexed.

The way the processing pipelines look like should be configurable by the
user. My first thought was that each processing step will become a component
that can be configured through Avalon's configuration mechanism. I just
don't know if that's right because in Phoenix this is up to the "application
assembler", while in the crawler this may well be a user's task. Components
may be dependent on global services, like a scheduler or a global host
manager.

If you have a large web crawler it is likely that a crawler gets 100
docs/second or more, and you end up with about 1500 extracted URLs and 1-2MB
of documents per second. If you have a multi-threaded or multi-process
system it means synchronization becomes an issue, and it is likely that you
have to have several queues between the parts and exchange data in batch
mode (e.g. every couple of seconds).

We thought about using local queues for communication between threads, and
JMS queues for communication between processes.
In the end we want to provide different configurations for different needs:
One file indexer that indexes a file system, an intranet crawler, or a large
web crawler. We could also think of a database indexer that provides
full-text indices for database tables.

--Clemens







--
To unsubscribe, e-mail:   <mailto:avalon-users-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:avalon-users-help@;jakarta.apache.org>

Reply via email to