Any other opinions on how smart/stupid it is to use the Nutch
crawler/fetcher exclusively without the indexer/deduper etc???  I.e.,
is it worth the trouble?  I just need a solid crawler to pull down
html pages.

I spent some time w/ wget today including hacking in some missing
features (apparently there hasn't been a maintainer in a while) - it
seems pretty legit for mirroring.  However, unlike the nutch crawler,
there's no javascript link "extraction" (i know i know, its just a
regex).  There's also no way to say "only grab 50,000 pages max" --
the only control is depth level (although i'm sure i could hack that
in as well).  It's also missing any logic to not go down a recursive
html trap.

Apparently Nutch and wget can both do frames and cookies as well ... a
tie there I guess.

If anyone wants to chime in ... what would you use?  Nutch crawler
hacked up a bit, wget, or ... something else??  Again, i've got about
3500 domains to crawl, each one is a large/dynamic site.

Thanks all,
John

On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:
> Interesting idea, a few negatives:
>
> 1) have to roll your own "only hit one domain at a time" (i.e.,
> politeness) into it
> 2) No pdf/word file parsing
> 3) Support for browser/spider traps?  i.e., recursive loops?
> 4) scalability on 3000+ large domains? We're talking millions of URLs here.
> 5) No js link extraction (although i'm not sure how solid that really
> is on nutch anyways)
>
> Postives are wget is obviously simple ... i just assumed that Nutch
> fetcher would be more advanced.  Am I mistaken?
>
> I'm assuming that Nutch can do cookies and frames as well??
>
> Thanks,
> John
>
> On 4/25/07, Briggs <[EMAIL PROTECTED]> wrote:
> > If you are just looking to have a seed list of domains, and would like
> > to mirror their content for indexing, why not just use the unix tool
> > 'wget'?  It will mirror the site on your system and then you can just
> > index that.
> >
> >
> >
> >
> > On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > I am hoping crawl about 3000 domains using the nutch crawler +
> > > PrefixURLFilter, however, I have no need to actually index the html.
> > > Ideally, I would just like each domain's raw html pages saved into 
> > > separate
> > > directories.  We already have a parser that converts the HTML into indexes
> > > for our particular application.
> > >
> > > Is there a clean way to accomplish this?
> > >
> > > My current idea is to create a python script (similar to the one already 
> > > on
> > > the wiki) that essentially loops through the fetch, update cycles until
> > > depth is reached, and then simply never actually does the real lucene
> > > indexing and merging.  Now, here's the "there must be a better way" part 
> > > ...
> > > I would then simply execute the "bin/nutch readseg -dump" tool via python 
> > > to
> > > extract all the html and headers (for each segment) and then, via a regex,
> > > save each html output back into an html file, and store it in a directory
> > > according to the domain it came from.
> > >
> > > How stupid/slow is this?  Any better ideas?  I saw someone previously
> > > mentioned something like what I want to do, and someone responded that it
> > > was better to just roll your own crawler or something?  I doubt that for
> > > some reason.  Also, in the future we'd like to take advantage of the
> > > word/pdf downloading/parsing as well.
> > >
> > > Thanks for what appears to be a great crawler!
> > >
> > > Sincerely,
> > > John
> > >
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to