> Do you "serve" the index in any way shape or form or is that done via some > other server or just put on a filesystem or whatever?
A Lucene index is put into an abstract "Directory" which can be a RAMDirectory or a directory in a file system. Querying the index is not a big problem (it's easier as, say, querying a JDBC connection); the client directly accesses the index (which is thread safe). There exist some applications for that, e.g. a servlet that does querying, and from what I know Peter Donald wants to work on that part; most of the work will go to the indexing part, though > There is the job server part > * schedules jobs to happen at certain times > * schedules jobs to happen in response to certain events exactly. A scheduler might be pretty dumb (indexing the whole server at 00:00 AM) or more intelligent (being notified of changes in the file system and reindexing them when the system is idle). > There is retrieval component > * retrieves documents > * caches documents? > * converts documents? Sure, if you safe XML documents in the index, conversion steps may be necessary. But conversion will mostly be done at indexing time. Lucene stores documents in "Fields" which consist of "Tokens". These form the inverted index that is queried. In the tokenization process it is defined which parts of a document form tokens (e.g. words), in which field they go (e.g. "body" or "author"), and may involve language specific processing (e.g. "stemming"). > And this all feeds into an indexer of some sort. That may or may not be > exported via a webservice or something? That's a possible extension, but is currently not at the center of our efforts. If you want to serve millions of documents, the interfaces must be _very_ efficient. I may provide some more preliminary information of what we thought about: We want the crawler to be scalable up to millions of servers. That is, the crawler itself must be distributable with different setups (say 3 crawlers and 2 indexers). But it should also be able to run within one server that does crawling and indexing at a time, say for an intranet with some 100.000 documents. Since indexing can be compared with an assembly line, where you get some source documents at the start, do some processing, and save the results at some drain, one can think that this pipeline can well be made of one piece, or be separated into two or more pieces that are connected through "drains" and "sources" consisting of some IPC mechanism. In the crawler you actually have two pipelines: one for URLs and one for the crawled documents. The URLs may be put into the pipeline, where they undergo several filter steps (e.g. a URL that was already crawled is thrown away by a "URLSeenFilter"). Once the documents are crawled they are put into the indexing pipeline, where they may be transformed (say from HTML or PDF) into an internal format, and then indexed. The way the processing pipelines look like should be configurable by the user. My first thought was that each processing step will become a component that can be configured through Avalon's configuration mechanism. I just don't know if that's right because in Phoenix this is up to the "application assembler", while in the crawler this may well be a user's task. Components may be dependent on global services, like a scheduler or a global host manager. If you have a large web crawler it is likely that a crawler gets 100 docs/second or more, and you end up with about 1500 extracted URLs and 1-2MB of documents per second. If you have a multi-threaded or multi-process system it means synchronization becomes an issue, and it is likely that you have to have several queues between the parts and exchange data in batch mode (e.g. every couple of seconds). We thought about using local queues for communication between threads, and JMS queues for communication between processes. In the end we want to provide different configurations for different needs: One file indexer that indexes a file system, an intranet crawler, or a large web crawler. We could also think of a database indexer that provides full-text indices for database tables. --Clemens -- To unsubscribe, e-mail: <mailto:avalon-users-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:avalon-users-help@;jakarta.apache.org>
