On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote:
> 
> In other words, if I find some results for some page on API page n and
> no results on API page n+1, can I be sure there will be no results on
> pages > n?

Not necessarily. In most cases that assumption should be true, but I see
a few cases offhand where it wouldn't be:

* If you're using prop=revisions&revids=...&rvprop=content with
  revisions big enough that the API response size limit comes into play,
  you could wind up in a situation where the initial query returns
  revision 1 from page A, the second returns revision 2 from page B, and
  the third returns revision 3 from page A again.
* Some modules, such as prop=extlinks, cannot use anything sane for the
  continue parameter (or else MySQL blows up), so they just use "offset
  into the arbitrarily-ordered set of results". It's possible that edits
  made to the wiki between your calls could change the result set so
  that values are repeated, skipped, or both.
* If you are using multiple modules, it might be the case that one
  goes through the pages in order by page_id while the other goes by
  title, or something along those lines. In practice it seems that all
  modules that commonly continue will order by the page_id, so the only
  way you might run into this is if the API response size limit causes
  modules like categoryinfo or imageinfo that usually don't continue to
  do so.

I haven't checked any of the prop modules provided by extensions, BTW.
Chances are most of those are well-behaved and order by page_id, but
it's possible some of them may do things differently.

> I am writing a library to access the API and every collection in my
> library is lazy.
> 
> For example, a user requests to know categories of pages in
> Category:Query languages.
> 
> When he starts iterating over the result, I execute the query:
> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories
> 
> When he then requests to know the categories of the third page in the
> result (Access query language),
> I will return to him the categories from the first query. If he
> requests more, I execute the query:
> http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Query%20languages&prop=categories&clcontinue=494528|All%20pages%20needing%20cleanup

How do you determine that you should look at "Access query language"
first rather than one of the other pages?

In my bot code, I have something that behaves similarly: you give it a
query, and it gives back a series of result pages. But my version will
process clcontinue all the way to the end right away; the laziness is
only in handling gcmcontinue. That way I can be sure that the page nodes
returned by successive calls will have all the necessary data without
worrying about the ordering of the prop module results.

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to