I use R a great deal but the huge web crawling power of it isn't an area I've 
used. I don't want to reinvent a cyberwheel and I suspect someone has done what 
I want.  That is a program that would run once a day (easy for me to set up as 
a cron task) and would crawl a single root of a web site (mine) and get the 
file size and a CRC or some similar check value for each page as pulled off the 
site (and, obviously, I'd want it not to follow off site links). The other key 
thing would be for it to store the values and URLs and be capable of being run 
in "create/update database" mode or in "check pages" mode and for the change 
mode run to Email me a warning if a page changes.  The reason I want this is 
that two of my sites have recently had content "disappear": neither I nor the 
ISP can see what's happened and we are lacking the very useful diagnostic of 
the date when the change happened which might have mapped it some component of 
WordPress, plugins or themes having updated.

I am failing to find anything such and all the services that offer site 
checking of this sort are prohibitively expensive for me (my sites are zero 
income and either personal or offering free utilities and information).

If anyone has done this, or something similar, I'd love to hear if you were 
willing to share it.  Failing that, I think I will have to create this but I 
know it will take me days as this isn't my area of R expertise and as, to be 
brutally honest, I'm a pretty poor programmer.  If I go that way, I'm sure 
people may be able to point me to things I may be (legitimately) able to 
recycle in parts to help construct this.

Thanks in advance,

Chris

-- 
Chris Evans <ch...@psyctc.org> Skype: chris-psyctc
Visiting Professor, University of Sheffield <chris.ev...@sheffield.ac.uk>
I do some consultation work for the University of Roehampton 
<chris.ev...@roehampton.ac.uk> and other places but this <ch...@psyctc.org> 
remains my main Email address.
I have "semigrated" to France, see: 
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to 
book to talk, I am trying to keep that to Thursdays and my diary is now 
available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take 
you to my blog which started with earlier joys in France and Spain!

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to