On Tue, 26 Mar 2019 at 08:25, Bertrand Delacretaz
<bdelacre...@apache.org> wrote:
>
> On Mon, Mar 25, 2019 at 10:16 AM Bertrand Delacretaz
> <bdelacre...@apache.org> wrote:
> > ...I have saved the contents of https://wiki.apache.org/incubator/ at
> > https://svn.apache.org/repos/private/pmc/incubator/wiki-archive-march-2019/ 
> > ...
>
> FWIW, as someone was asking how that was done, I just used
>
>    wget -r -l5 -np https://wiki.apache.org/incubator/
>
> and then semi-manually removed the help pages based on their names,
> which are in several languages.

Unfortunately that does not seem to capture all the pages, for example
the following is missing:

https://wiki.apache.org/incubator/WookieProposal

Something must have gone wrong with the download.

I did another download using the following command:

wget -r -np -l1 --reject-regex '(.*)\?(.*)'
http://wiki.apache.org/incubator/TitleIndex

The TitleIndex should have links to every page, so there's no need to
follow links further.
Also the regex stops it from asking for raw pages etc.

Also used the following .wgetrc
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
referer = /
robots = off
random_wait = on
wait = 1

OK to add the missing pages to the archive?

S.



> -Bertrand
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to