Hi James, Wget2 is built on top of the libwget library which uses Asynchronous network calls. However, Wget2 is written such that it only utilizes one connection per thread. This is essentially a design decision to simplify the codebase. In case you want a more complex crawler, you can use libwget to write your own as Tim suggested in his email.
Instead of this kind of async behaviour, we rely on HTTP/2 multiplexed streams which allow you to send multiple requests over the same connection in parallel. So, when crawling any website using HTTP/2, Wget2 can get the benefits of async access without requiring all those code paths. * James Read <[email protected]> [180731 20:28]: > Thanks, > > as I understand it though there is only so much you can do with threading. > For more scalable solutions you need to go with async programming > techniques. See http://www.kegel.com/c10k.html for a summary of the > problem. I want to do large scale webcrawling and am not sure if wget2 is > up to the job. > > On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <[email protected]> wrote: > > > On 31.07.2018 18:39, James Read wrote: > > > Hi, > > > > > > how much work would it take to convert wget into a fully fledged > > > asynchronous webcrawler? > > > > > > I was thinking something like using select. Ideally, I want to be able to > > > supply wget with a list of starting point URLs and then for wget to crawl > > > the web from those starting points in an asynchronous fashion. > > > > > > James > > > > > > > Just use wget2. It is already packaged in Debian sid. > > To build from git source, see https://gitlab.com/gnuwget/wget2. > > > > To build from tarball (much easier), download from > > https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz. > > > > Regards, Tim > > > > > -- Thanking You, Darshit Shah PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
signature.asc
Description: PGP signature
