Re: [WWWOFFLE-Users] Size limits on web scrapes ?

Andrew M. Bishop Sat, 21 Sep 2002 10:38:17 -0700

Andy Rabagliati <[EMAIL PROTECTED]> writes:

>   I have configured wwwoffle to work across an intermittent
>   UUCP transport. This is for schools in South Africa, where
>   I am configuring them to use cheap overnight dialup for
>   email and WWW access.
> 
>   For WWW access, they have wwwoffle on the LAN - in offline
>   mode. At dialup time, I tar up all the outgoing requests,
>   and pass them over to my well-connected box, untarring the
>   requests into the outgoing folder.
> 
>   I perform the fetch, make a note of all the files downloaded,
>   tar up those files, and when the school calls again in the
>   morning I pass the results back to back-fill the school cache.
> 
>   All works great - Thanks !!
> 
>   However, I would like to limit the size of the scrape - in
>   case things get out of hand, with heavily recursive scrapes
>   or ftp directories.
> 
>   I could put a time limit on the scrape - killing wwwoffle
>   if it runs too long, but ideally I would like it to be a
>   config option.
> 
>   Would this be hard to add ?


Unfortunately it would be hard to add because of the way that WWWOFFLE
works.  During the process of fetching there is a main process that
keeps the required number of fetch processes running.  These child
processes that atually do the fetching of the URLs do not report back
the amount of data that they have fetched.

To make a change so that the total amount of data that has been
fetched can be collated would either be a very big change or just a
kludge to the existing system.

Setting a time limit should be easy enough to do with a shell script
that calls 'wwwoffle -fetch' to start with and 'wwwoffle -offline' to
stop the fetching.

A script like the following (untested) one should do the trick:

-------------------- wwwoffle_fetch_timeout.sh --------------------
#!/bin/sh

TIMEOUT=600

wwwoffle -online

wwwoffle -fetch > wwwoffle-fetch.log 2>&1 &
pid1=$!

( sleep $TIMEOUT ; wwwoffle -offline ) > /dev/null 2>&1 &
pid2=$!

wait $pid1 > /dev/null 2>&1 ; kill -KILL $pid2 > /dev/null 2>&1

wwwoffle -offline
-------------------- wwwoffle_fetch_timeout.sh --------------------

This will start the fetch process in the background.  In a second
background task it will sleep for a while and then set wwwoffle
offline (which will end the fetch process).  While this is going on
the shell script is waiting for the fetch to finish and when it does
it will kill the timeout.  So whichever finishes first, the timeout
or the fetch, will stop the other and allow the script to continue.

Unless your process ids wrap around within the timeout there should be
no problem.

-- 
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop                             [EMAIL PROTECTED]
                                      http://www.gedanken.demon.co.uk/

WWWOFFLE users page:
        http://www.gedanken.demon.co.uk/wwwoffle/version-2.7/user.html

Re: [WWWOFFLE-Users] Size limits on web scrapes ?

Reply via email to