I'd suggest "wget" for spidering sites. It can be told to ignore .robots files. It is good for mirroring sites which you suspect may be taken down. Win/Unix versions available.
- whitehouse.gov/robots.txt Eugen Leitl
- Re: whitehouse.gov/robots.txt Anatoly Vorobey
- Re: whitehouse.gov/robots.txt Declan McCullagh
- Re: whitehouse.gov/robots.txt FB`
- Major Variola (ret)