That sounds awesome! You have my vote... :)
On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder <gvtul...@gmail.com> wrote: > Hi, > > I'd like to propose a new feature that allows Wget to make WARC files. > > Perhaps you're already familiar with it, but in short: WARC is a file > format for web archives. In a single WARC file, you can store every file of > the website, plus the HTTP request and response headers and other metadata. > This makes it a very useful format for web archivists: you keep everything > together, in the most detailed and original form. > > The WARC format (an ISO standard, ISO 28500) has been developed by the > International Internet Preservation Consortium, which includes the Internet > Archive and many national libraries. It is supposed to become *the* standard > file format for web archives. For example, it is used in the Internet > Archive's Wayback Machine and its Heritrix crawler. There are several > projects building tools to work with WARC files. > > > It would be cool if Wget could become one of these tools. Already the Swiss > army knife for mirroring websites, the one thing that Wget is missing is a > good way to store these mirrors. The current output of --mirror is not > sufficient for archival purposes: > > - it throws away the HTTP headers (of the request and response); > - it doesn't keep 404 pages and redirects; > - it doesn't store the original urls but mangles the filenames; > - and, if you're not careful, it even rewrites the links inside > the documents that it has downloaded. > > The WARC format supports these things. > > > With some help from others, I've added WARC functions to Wget. With the > --warc-file option you can specify that the mirror should also be written to > a WARC archive. Wget will then keep everything, including the HTTP request > and response headers, redirects and 404 pages. > > Do you think this is something that could be included in the main Wget > version? If that's the case, what should be the next step? > > Description, links to more information about WARC: > > http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_output<http://www.archiveteam.org/index.php?title=Wget_with_WARC_output> > > Code: > https://github.com/alard/wget-**warc/<https://github.com/alard/wget-warc/> > https://github.com/downloads/**alard/wget-warc/wget-warc-** > 20110809.tar.bz2<https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2> > > The implementation makes use of the open source WARC Tools library > (Apache License 2.0): > http://code.google.com/p/warc-**tools/<http://code.google.com/p/warc-tools/> > > > I look forward to your response. > > Kind regards, > > Gijs van Tulder > > -- ** *Patrick Steil | ChurchBuzz.org* Church Website Optimization <http://www.churchbuzz.org/> Like us on Facebook <http://facebook.com/churchbuzz>! Mobile: 940-391-9250