Andy Rabagliati <[EMAIL PROTECTED]> writes:
> I am passing the outgoing queue of requests from a disconnected LAN
> back to a well connected machine for woffle -fetch.
>
> After fetching, I am packing up the fetched files for delivery back
> to the LAN machine for serving.
>
> In order to locate all the files created from this fetch session,
> including embedded images and recursive fetches, I am parsing the
> output of wwwoffle -fetch with a perl script.
>
> So far, so good.
>
> The problem now that arises is that if a second batch of requests
> arrives for another remote machine, the requests all pile up in
> the same outgoing directory, and my perl script can no longer
> differentiate between the batches.
>
> I could serialise all instances, but that slows things considerably.
>
> I am considering adding a --outgoing <dir> to the wwwoffle program,
> allowing me to use the same file cache but parse the batches separately.
>
> Is there a better way to do it ?
I think that serializing the requests is the best way of doing it
(even though you don't like the idea). I don't agree with your
objection to it and it solves another problem.
I don't see how it can be any slower to serialize. You will make the
same number of requests if you do it serially or in parallel. If you
get four requests in all at the same time then processing serially
will mean that one request is completed after time T, one after time
2*T, one after time 3*T and one after time 4*T. When doing it in
parallel they all take 4*T to finish.
If you have more than one fetch occuring in parallel another problem
that you get is that you end up with an inconsistent state. A second
fetch may be updating one of the pages while you are selecting the
modified pages from the cache. This could see you sending an
incomplete page back to the first requestor.
Another problem is a more general one. If you don't keep exactly the
same cache locally as is stored remotely then you can send incomplete
updates. For example if the first requestor asks for a particular
page it will be fetched as will the images. The second requestor then
asks for the same page. The WWWOFFLE fetch process doesn't fetch it
again because it is already up to date (due to the first request), but
the page that is in the cache is newer than the one that the second
requestor has. This means that they don't receive the changed page.
If the page changes infrequently then this can delay the updating of
the page at the remote site.
I think that the best way of doing this (if you can afford the disk
space) is to run multiple WWWOFFLE servers on the well connected
machine. Each of these is a mirror of one of the remote machines.
When you want to find the changes you just need to find the files that
have been modified (possibly using your existing script). You can
just then send the new files back to the remote site.
To reduce the total network bandwidth on the well connected site you
can make all of the WWWOFFLE proxies go through another proxy so that
you don't need to fetch common pages more than once.
--
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop [EMAIL PROTECTED]
http://www.gedanken.demon.co.uk/
WWWOFFLE users page:
http://www.gedanken.demon.co.uk/wwwoffle/version-2.7/user.html