Re: [Bug-wget] Concurrency and wget

Tim Ruehsen Tue, 10 Apr 2012 08:52:37 -0700

Meanwhile, I wrote a simple proof of concept (parallel dummy downloads using 
threads, dummy downloading of chunks, etc.).
I am at the point where I want to implement HTTP-Header metalink (RFC 6249).
I just can't find any servers to test with... maybe you can help me out ?


Well, since there is no response to my previous post: is there any interest in 
getting that done anyway ?

    Tim

Am Tuesday 03 April 2012 schrieb Tim Ruehsen:
> Hi Giuseppe, hi Micah,
> 
> while couldn't sleep last night, I thought about wget and concurrency...
> 
> I had the idea of using a top-down approach to outline what wget is doing.
> Just to have a overview without struggling with the details of
> implementation. As a side effect one would have a (textual? graphical?)
> starting point for contributors to rush into the project. A chance to have
> a clear and well documented design.
> 
> Since maintenance of a flowchart is time-consuming and requires some extra
> skills and tools, pure texts in the form of a "programming language" seems
> to fit.
> 
> Here is just a beginning, let's say a basis for discussions.
> If you don't mind, I would like take part in ongoing development.
> 
> Basic wget functionality (download given URI/IRI):
> 
> main (URI) {
>       put <URI> into <queue>
> 
>       while <queue> is not empty {
>               download_and_analyse(next <queue> entry)
>       }
> }
> 
> download_and_analyse (URI) {
>       download URI to FILE
>       add URI to <downloaded>
>       remove URI from <queue>
>       scan FILE and add URIs to <queue> if not already in <downloaded>
> }
> 
> 
> Extended for simple multitasking (threaded, multi processes or even
> distributed).
> This is just one possible design for concurrent downloads.
> Maybe you have a more elegant idea.
> 
> main (URI) {
>       create <N> downloaders
>       put <URI> into <queue>
> 
>       wait for status message from downloader {
>               print status
>               if <queue> is empty {
>                       stop downloaders
>                       we are done
>               }
>       }
> }
> 
> downloader {
>       wait for and allocate entry in <queue> {
>               download_and_analyse(entry)
>       }
> }
> 
> download_and_analyse (URI) {
>       download URI to FILE
>       add URI to <downloaded>
>       remove URI from <queue>
>       scan FILE and add URIs to <queue> if not already in <downloaded>
> }
> 
> 
> Extended to download a URI from several sources in parallel.
> main and downloader stay the same, just download_and_analyse() is extended.
> 
> download_and_analyse (URI) {
>       /* download URI to FILE */
>       put <X> chunk entries into <chunk_queue>
>       create <X> chunkloaders
>       wait for status message from chunkloader {
>               send modified status message to main
>               if <chunk_queue> is empty {
>                       stop chunk_loaders
>                       end loop
>               }
>       }
> 
>       add URI to <downloaded>
>       remove URI from <queue>
>       scan FILE and add URIs to <queue> if not already in <downloaded>
> }
> 
> chunk_loader {
>       wait for and allocate entry in <chunk_queue> {
>               download(entry)
>               remove entry from <chunk_queue>
>       }
> }
> 
> After some iterations we should come to a point where we can make further
> decisions:
> - how to implement concurrency (threads, processes, distributed process,
> (cloud))
> - how to implement communication between tasks
> - is a wget rewrite reasonable ?
> - which existing code to recycle ?
> - creating libraries from existing code (e.g. libwget) or use external
> libraries
>   (e.g. for network stuff, parsing and creating URI/IRIs, etc.)
> - create a list of test code, especially for the library code
> - ... etc etc ...
> 
> 
>     Tim

Re: [Bug-wget] Concurrency and wget

Reply via email to