Re: [Bug-wget] Async webcrawling

Darshit Shah Wed, 01 Aug 2018 03:29:27 -0700

Hi James,

Wget2 is built on top of the libwget library which uses Asynchronous network
calls. However, Wget2 is written such that it only utilizes one connection per
thread. This is essentially a design decision to simplify the codebase. In case
you want a more complex crawler, you can use libwget to write your own as Tim
suggested in his email.


Instead of this kind of async behaviour, we rely on HTTP/2 multiplexed streams
which allow you to send multiple requests over the same connection in parallel.
So, when crawling any website using HTTP/2, Wget2 can get the benefits of async
access without requiring all those code paths.


* James Read <[email protected]> [180731 20:28]:
> Thanks,
> 
> as I understand it though there is only so much you can do with threading.
> For more scalable solutions you need to go with async programming
> techniques. See http://www.kegel.com/c10k.html for a summary of the
> problem. I want to do large scale webcrawling and am not sure if wget2 is
> up to the job.
> 
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <[email protected]> wrote:
> 
> > On 31.07.2018 18:39, James Read wrote:
> > > Hi,
> > >
> > > how much work would it take to convert wget into a fully fledged
> > > asynchronous webcrawler?
> > >
> > > I was thinking something like using select. Ideally, I want to be able to
> > > supply wget with a list of starting point URLs and then for wget to crawl
> > > the web from those starting points in an asynchronous fashion.
> > >
> > > James
> > >
> >
> > Just use wget2. It is already packaged in Debian sid.
> > To build from git source, see https://gitlab.com/gnuwget/wget2.
> >
> > To build from tarball (much easier), download from
> > https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.
> >
> > Regards, Tim
> >
> >
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

signature.asc
Description: PGP signature

Re: [Bug-wget] Async webcrawling

Reply via email to