Re: Basic code organization questions + scheduling

2008-09-08 Thread Chris K Wensel
If you wrote a simple URL fetcher function for Cascading, you would have a very powerful web crawler that would dwarf Nutch in flexibility. That said, Nutch is optimized for storage, has supporting tools, ranking algorithms, and has been up against some nasty html and other document types

Re: Basic code organization questions + scheduling

2008-09-08 Thread tarjei
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alex (and others). > You should take a look at Nutch. It's a search-engine built on Lucene, > though it can be setup on top of Hadoop. Take a look: This didn't help me much. Although the description I gave of the basic flow of the app seems to be

Re: Basic code organization questions + scheduling

2008-09-07 Thread Alex Loddengaard
Hi Tarjei, You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look: -and- Hope this helps! Alex On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse

Basic code organization questions + scheduling

2008-09-07 Thread Tarjei Huse
Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages -> start new job extract text -> index / filter (as n