[Nutch-general] Recrawl not following crawl-urlfilter.txt

Steve Kallestad Thu, 08 Feb 2007 01:18:33 -0800

Please oh please, don't shoot me for being a newbie.

I have set up a site-search using nutch, and I have the
crawl-urlfilter.txtfile configured so that everything works properly
when I call something
similar to:


bin/nutch crawl urls -dir crawl -depth 3 -topN 100


I grabbed the Intranet Recrawl script from
http://wiki.apache.org/nutch/IntranetRecrawl

I noticed while it was running that nutch was actually grabbing files I
didn't want it to grab, and it was also going off site to get others.
Obviously I don't want it to do that.

On my site, without making a change to the crawl-urlfilter.txt file, nutch
is trying to fetch some non-existant files, probably because of some
javascript that I have, so I really need my re-crawl to follow my original
guidelines.

My question is - how can I modify the IntranetRecrawl script so that it
follows crawl-urlfilter.txt, or barring that where can I find a documented
list of steps to recrawl my site?


Thanks,
Steve

My nutch is at:
http://www.stevekallestad.com/search/
in case anybody wanted to check it out.  I have the directory proxied
through apache which I thought was pretty cool.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Recrawl not following crawl-urlfilter.txt

Reply via email to