I don't know if it's an option for you, but we make a full replica of the production search indices available in WMF cloud. The full elasticsearch DSL can be used to query these instances. See https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_elasticsearch_replicas
Erik B. On Mon, Nov 27, 2023 at 12:02 PM <rzisso...@gmail.com> wrote: > Hello, > > i am currently gathering image data for my master thesis. I am using the > QLabels from wikidata, to crawl specific image classes (like axe, car etc.). > > I am using the Action API for my requests and now my problem: > > The QLabel Q870 (train) has around 21k images. I am using the sroffset > parameter and the "continue" parameter from the response to search for 500 > images at a time. The script is working until I reach the 10k limit (the > message is like: 'you request exceeded the limit of 10000 items ..."). Is > there any option, that I can crawl more than 10k items/images from one > search query? > > My search query looks like this: > params = { > 'action': 'query', > 'format': 'json', > 'list': 'search', > 'srsearch': search_query, > 'srnamespace': '0|6|12|14|100|106', # Namespace filter based > on the provided URL > 'srlimit': batch_size, # Number of images per batch > 'sroffset': start, # Offset for pagination > 'prop': 'info|imageinfo', # Request additional information > about the pages (images) > 'inprop': 'url' # Include the URL information > } > the 'sroffset' parameter is always updated, with the result from the > "continue" param from the response I get. > > It would be a great, if somebody could help me! > > Thank you! > Kind regards > Ruben > _______________________________________________ > Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org > To unsubscribe send an email to mediawiki-api-le...@lists.wikimedia.org >
_______________________________________________ Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org To unsubscribe send an email to mediawiki-api-le...@lists.wikimedia.org