I don't know if it's an option for you, but we make a full replica of the
production search indices available in WMF cloud. The full elasticsearch
DSL can be used to query these instances. See
https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_elasticsearch_replicas

Erik B.

On Mon, Nov 27, 2023 at 12:02 PM <rzisso...@gmail.com> wrote:

> Hello,
>
> i am currently gathering image data for my master thesis. I am using the
> QLabels from wikidata, to crawl specific image classes (like axe, car etc.).
>
> I am using the Action API for my requests and now my problem:
>
> The QLabel Q870 (train) has around 21k images. I am using the sroffset
> parameter and the "continue" parameter from the response to search for 500
> images at a time. The script is working until I reach the 10k limit (the
> message is like: 'you request exceeded the limit of 10000 items ..."). Is
> there any option, that I can crawl more than 10k items/images from one
> search query?
>
> My search query looks like this:
> params = {
>             'action': 'query',
>             'format': 'json',
>             'list': 'search',
>             'srsearch': search_query,
>             'srnamespace': '0|6|12|14|100|106',  # Namespace filter based
> on the provided URL
>             'srlimit': batch_size,  # Number of images per batch
>             'sroffset': start,  # Offset for pagination
>             'prop': 'info|imageinfo',  # Request additional information
> about the pages (images)
>             'inprop': 'url'  # Include the URL information
>         }
> the 'sroffset' parameter is always updated, with the result from the
> "continue" param from the response I get.
>
> It would be a great, if somebody could help me!
>
> Thank you!
> Kind regards
> Ruben
> _______________________________________________
> Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org
> To unsubscribe send an email to mediawiki-api-le...@lists.wikimedia.org
>
_______________________________________________
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org
To unsubscribe send an email to mediawiki-api-le...@lists.wikimedia.org

Reply via email to