Hi James,

Wget2 is built on top of the libwget library which uses Asynchronous network
calls. However, Wget2 is written such that it only utilizes one connection per
thread. This is essentially a design decision to simplify the codebase. In case
you want a more complex crawler, you can use libwget to write your own as Tim
suggested in his email.

Instead of this kind of async behaviour, we rely on HTTP/2 multiplexed streams
which allow you to send multiple requests over the same connection in parallel.
So, when crawling any website using HTTP/2, Wget2 can get the benefits of async
access without requiring all those code paths.


* James Read <[email protected]> [180731 20:28]:
> Thanks,
> 
> as I understand it though there is only so much you can do with threading.
> For more scalable solutions you need to go with async programming
> techniques. See http://www.kegel.com/c10k.html for a summary of the
> problem. I want to do large scale webcrawling and am not sure if wget2 is
> up to the job.
> 
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <[email protected]> wrote:
> 
> > On 31.07.2018 18:39, James Read wrote:
> > > Hi,
> > >
> > > how much work would it take to convert wget into a fully fledged
> > > asynchronous webcrawler?
> > >
> > > I was thinking something like using select. Ideally, I want to be able to
> > > supply wget with a list of starting point URLs and then for wget to crawl
> > > the web from those starting points in an asynchronous fashion.
> > >
> > > James
> > >
> >
> > Just use wget2. It is already packaged in Debian sid.
> > To build from git source, see https://gitlab.com/gnuwget/wget2.
> >
> > To build from tarball (much easier), download from
> > https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.
> >
> > Regards, Tim
> >
> >
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature

Reply via email to