Any other opinions on how smart/stupid it is to use the Nutch crawler/fetcher exclusively without the indexer/deduper etc??? I.e., is it worth the trouble? I just need a solid crawler to pull down html pages.
I spent some time w/ wget today including hacking in some missing features (apparently there hasn't been a maintainer in a while) - it seems pretty legit for mirroring. However, unlike the nutch crawler, there's no javascript link "extraction" (i know i know, its just a regex). There's also no way to say "only grab 50,000 pages max" -- the only control is depth level (although i'm sure i could hack that in as well). It's also missing any logic to not go down a recursive html trap. Apparently Nutch and wget can both do frames and cookies as well ... a tie there I guess. If anyone wants to chime in ... what would you use? Nutch crawler hacked up a bit, wget, or ... something else?? Again, i've got about 3500 domains to crawl, each one is a large/dynamic site. Thanks all, John On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote: > Interesting idea, a few negatives: > > 1) have to roll your own "only hit one domain at a time" (i.e., > politeness) into it > 2) No pdf/word file parsing > 3) Support for browser/spider traps? i.e., recursive loops? > 4) scalability on 3000+ large domains? We're talking millions of URLs here. > 5) No js link extraction (although i'm not sure how solid that really > is on nutch anyways) > > Postives are wget is obviously simple ... i just assumed that Nutch > fetcher would be more advanced. Am I mistaken? > > I'm assuming that Nutch can do cookies and frames as well?? > > Thanks, > John > > On 4/25/07, Briggs <[EMAIL PROTECTED]> wrote: > > If you are just looking to have a seed list of domains, and would like > > to mirror their content for indexing, why not just use the unix tool > > 'wget'? It will mirror the site on your system and then you can just > > index that. > > > > > > > > > > On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > > > I am hoping crawl about 3000 domains using the nutch crawler + > > > PrefixURLFilter, however, I have no need to actually index the html. > > > Ideally, I would just like each domain's raw html pages saved into > > > separate > > > directories. We already have a parser that converts the HTML into indexes > > > for our particular application. > > > > > > Is there a clean way to accomplish this? > > > > > > My current idea is to create a python script (similar to the one already > > > on > > > the wiki) that essentially loops through the fetch, update cycles until > > > depth is reached, and then simply never actually does the real lucene > > > indexing and merging. Now, here's the "there must be a better way" part > > > ... > > > I would then simply execute the "bin/nutch readseg -dump" tool via python > > > to > > > extract all the html and headers (for each segment) and then, via a regex, > > > save each html output back into an html file, and store it in a directory > > > according to the domain it came from. > > > > > > How stupid/slow is this? Any better ideas? I saw someone previously > > > mentioned something like what I want to do, and someone responded that it > > > was better to just roll your own crawler or something? I doubt that for > > > some reason. Also, in the future we'd like to take advantage of the > > > word/pdf downloading/parsing as well. > > > > > > Thanks for what appears to be a great crawler! > > > > > > Sincerely, > > > John > > > > > > > > > -- > > "Conscious decisions by conscious minds are what make reality real" > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
