On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa <fattah.s...@gmail.com>
 wrote:

> I'm trying to get all the page titles in Wikipedia in namespace using the
> API as following:
>
> https://en.wikipedia.org/w/api.php?action=query&format=
> xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
>
> I keep requesting this url and checking the response if contains continue
> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
> the value in apcontinue attribute in the response.
> My applications had been running since 3 days and number of retrieved
> exceeds 30M, whereas it is about 13M in the dumps.
> any idea?
>

Please do not scrap the web for those kind of requests- it is a waste of
resources for you and for Wikimedia servers (given that there is a faster
and more reliable alternative).

Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:

2017-05-03 07:26:20 done List of all page titles
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz
(221.7 MB)
2017-05-03 07:22:02 done List of page titles in main namespace
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-ns0.gz
(70.8 MB)

Use one of the above. Not only it is faster, you will also get consistent
results- by the time you stop going over your loop, pages have been created
and deleted. The above exports are done trying to get the most consistent
state as practically possible, and actively monitored by WMF staff.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to