Hi Adam, Thanks for reaching out.
Why this matters: Today, most researchers rely on indirect signals—social-media trends or Google’s autosuggest API—to infer web users' interests. Direct, aggregate search-query data would be a primary source: a real-time window into topics that are gaining (or losing) attention both globally and within each language community. With such a dataset we could: + map emerging interests across regions and languages; + study the life-cycle of topics (how fast they spike and fade); + improve ranking algorithms by pairing queries with the results users actually click; + build applications that surface underserved information needs. And these are just some 'top of the head' ideas... Privacy: To avoid any risk of personal data leakage, only queries that appear more than X times in a given day/week would be released—never unique or low-frequency strings. Releasing aggregated click-through pairs (query: clicked page, count) would add tremendous research value without compromising user anonymity. I am happy to dive deeper or brainstorm implementation details whenever helpful. Cheers, -- Sérgio Nunes On Thu, 24 Jul 2025 at 13:09, Adam Baso <[email protected]> wrote: > Hi Sérgio, thanks for your message. Apologies for the delayed response. > > Speaking on behalf of the Data Platform Engineering (where the Search > Platform team resides and where most of the crucial knowledge for this sort > of dataset creation resides), we're not presently considering production of > this sort of dataset, as the focus is on different problems. It would be > difficult to prioritize this sort of dataset creation and maintenance. > > However, could you tell us a bit more here on the list about some of the > intended use cases and end users (direct and indirect) for such a dataset? > > Would you like to be connected with product management to discuss more > about your use cases? I wouldn't want to suggest that it means the type of > work will be prioritized, but our product management folks are looking for > themes in the various use cases as they help set the context for user needs > for the roadmap. > > Thanks! > -Adam > > On Thu, Jul 24, 2025 at 5:57 AM Sérgio Nunes <[email protected]> > wrote: > > > Hi, > > > > What would be the best Wikimedia interface to try to get this moving? > > > > Thanks for any sugestions > > -- > > Sérgio Nunes > > > > > > On Mon, 7 Jul 2025 at 13:23, Sérgio Nunes <[email protected]> wrote: > > > > > Hi all, > > > > > > I would like to suggest a new *highly valuable* data dump for > Wikipedia: > > > the release of aggregated search query logs. I am aware that a previous > > > release of search data was retracted due to privacy concerns. However, > I > > > believe there is a privacy-preserving approach that could still provide > > > great value to researchers. > > > > > > My proposal is to release only aggregated query data—specifically, > > queries > > > that have been observed more than X times within a given day or week. > The > > > dataset could follow a simple format such as: > > > > > > [day or week] [query text] [frequency] > > > > > > This method would eliminate the risk of exposing personal or unique > > search > > > queries. The dataset would be especially useful if released regularly > > > (e.g., monthly) and broken down by language-specific Wikipedias. > > > > > > > > > Is this the best forum for posting this suggestion? > > > > > > If you have suggestions for where to direct this proposal, or ideas for > > an > > > alternative approach, I would be grateful. > > > > > > Best regards, > > > -- > > > Sérgio Nunes > > > > > _______________________________________________ > > Wiki-research-l mailing list -- [email protected] > > To unsubscribe send an email to > [email protected] > > > _______________________________________________ > Wiki-research-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
