On Fri Ted Dunning <[email protected]> wrote: > The question that I don't see addressed is whether you choose to use > a fully streaming approach as is done in Bixo or whether you will use > a document repository approach as is more common in most search > engines.
I guess even when using a streaming approach a repository for temporary results is necessary to decouple those stages that are expensive and hard to reproduce. E.g. crawling to HBase and reading the results from there for further processing should prevent failures in post processing resulting in having to rerun the crawl. Most likely there are more of these points further down the processing chain as well. Isabel
