Miriam wrote a query to find images used over N times on a wiki, probably a placeholder or icon. ( https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/cassandra.py#L167). And here's the query to calculate this threshold. ( https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/cassandra.py#L143 ).
On Fri, Nov 4, 2022 at 2:52 PM Neil Shah-Quinn <nshahqu...@wikimedia.org> wrote: > I believe Connie Chen and Isaac Johnson did some work on distinguishing > "real images" from icons as part of the image suggestion analytics ( > T292316 <https://phabricator.wikimedia.org/T292316>). I don't know the > details, but perhaps one of them could chime in. > > ----- > Neil Shah-Quinn > senior data scientist, Product Analytics > <https://www.mediawiki.org/wiki/Product_Analytics> > Wikimedia Foundation <https://wikimediafoundation.org/> > > > On Fri, 4 Nov 2022 at 11:17, Dan Andreescu <dandree...@wikimedia.org> > wrote: > >> hm, you know, maybe it's not such a great idea to show all these small >> files in the mediarequests/top endpoint. I imagine everyone trying to use >> it would have the same problems you are. Maybe we can brainstorm together >> on a way to filter out results you might not want. If that top 1000 list >> included only images you found interesting, would that solve your problem? >> If so, let's brainstorm. >> >> So the schema of the data we have available is this >> <https://github.com/wikimedia/analytics-refinery/blob/master/hql/mediarequest/create_mediarequest_table.hql> >> . >> >> base_name string COMMENT 'Base name of media file', >> media_classification string COMMENT 'General classification of media >> (image, video, audio, data, document or other)', >> file_type string COMMENT 'Extension or suffix of the file >> (e.g. jpg, wav, pdf)', >> total_bytes bigint COMMENT 'Total number of bytes', >> request_count bigint COMMENT 'Total number of requests', >> transcoding string COMMENT 'Transcoding that the file was >> requested with, e.g. resized photo or image preview of a video', >> agent_type string COMMENT 'Agent accessing the media files, can >> be spider or user', >> referer string COMMENT 'Wiki project that the request was >> refered from. If project is not available, it will be either internal, >> external, or unknown', >> dt string COMMENT 'UTC timestamp in ISO 8601 format >> (e.g. 2019-08-27T14:00:00Z)' >> >> And here's some sample data (request count > 50000 for privacy). >> >> >> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","486642310","119697","image_0_199","user","en.wikipedia","2022-09-09T06:00:00Z","2022","9","9","6" >> >> "/wikipedia/commons/d/d4/Button_hide.png","image","png","26477640","93145","original","user","en.wikipedia","2022-09-09T23:00:00Z","2022","9","9","23" >> >> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","300264742","73620","image_0_199","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5" >> >> "/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif","image","gif","27279795","93779","original","user","ja.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3" >> >> "/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg","image","svg","86260254","130257","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3" >> >> "/wikipedia/commons/f/fa/Wikiquote-logo.svg","image","svg","254832231","83127","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3" >> >> "/wikipedia/en/a/a4/Flag_of_the_United_States.svg","image","svg","76327061","90739","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3" >> >> "/wikipedia/commons/b/b6/Queen_Elizabeth_II_in_March_2015.jpg","image","jpeg","1156030104","58651","image_200_399","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5" >> >> "/wikipedia/commons/2/28/Aaj_tak_logo.png","image","png","57716837856","469335","original","user","external","2022-09-09T02:00:00Z","2022","9","9","2" >> >> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","682088336","168655","image_0_199","user","en.wikipedia","2022-09-09T22:00:00Z","2022","9","9","22" >> >> Can you do some poking around to see if there's a size in bytes that >> would be a good threshold, or a standard transcoding that is most used on >> articles, or anything that would allow us to filter to only the kinds of >> images you're interested in? If we find that, my thought is we can just >> update the data behind the top 1000 endpoint. Then, if people want it >> unfiltered, they can download the dumps, but that seems like the >> exceptional case. >> >> (note: you would divide total_bytes by request_count if you want the size >> of the file) >> >> >> On Fri, Nov 4, 2022 at 11:10 AM Michele Mauri via Analytics < >> analytics@lists.wikimedia.org> wrote: >> >>> Hi! Yes I already tested those two ways. I used the mediarequests api ( >>> https://wikimedia.org/api/rest_v1/metrics/mediarequests/top/en.wikipedia.org/image/2022/05/all-days) >>> but since they are just the first 1000 the largest part is composed by >>> icons, buttons ets. While I’d like to focus on the images that illustrate >>> an article. >>> >>> >>> >>> I wrote a script to download all the dumps, open, sort and filter them >>> to get a longer list, but it’s very time consuming. >>> >>> >>> >>> I used in the past articles popularity as proxy, but I was looking for a >>> more granular approach and considering the usage of images also across >>> different linguistic versions >>> >>> >>> >>> Best >>> >>> >>> >>> Michele >>> >>> >>> >>> *From: *Dan Andreescu <dandree...@wikimedia.org> >>> *Date: *Friday, 4 November 2022 at 15:17 >>> *To: *Michele Mauri <michele.ma...@polimi.it> >>> *Cc: *A mailing list for the Analytics Team at WMF and everybody who >>> has an interest in Wikipedia and analytics. < >>> analytics@lists.wikimedia.org> >>> *Subject: *[Analytics] Re: Mediacounts fields >>> >>> I see. In practice, the mediaviewer instrumentation also had some >>> inaccuracies. For example, the code pre-fetched certain images when >>> opening a gallery even if the viewer never ended up looking at them. I >>> think they adjusted the instrumentation to account for that, but I don't >>> remember the details. >>> >>> >>> >>> One thought I had is, have you checked the mediarequests API >>> <https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests>? >>> It's used to power metrics like top media requests >>> <https://stats.wikimedia.org/#/en.wikipedia.org/content/top-mediarequests> >>> (per >>> project per month). And you can query it directly >>> <https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F1%2F1a%2FFlag_of_Argentina.svg/monthly/2022010100/2022100100> >>> for specific images. It's backed by the same mediacounts data, so you're >>> right, it counts all transfers. But that's a pretty good proxy for what >>> was seen by a user. If you look at the top 1000 files requested I linked, >>> you'll see a lot of icons and flags at the top, which makes sense. But in >>> between all that you'll see real images like Liz Truss's portrait and >>> Socrates and all that. You could filter to only larger images by >>> downloading the image and checking its size. >>> >>> >>> >>> Or you can go another way and look at the top 1000 articles >>> <https://stats.wikimedia.org/#/en.wikipedia.org/reading/top-viewed-articles> >>> on a wiki, find all their images, and analyze those. >>> >>> >>> >>> Take a look around at the APIs and see if there's a way forward through >>> that data (the stats.wikimedia.org site queries the API directly on the >>> client-side, so if you open up your browser's developer tools you can >>> discover the API that way. You can of course also browse the dynamic >>> docs <https://wikimedia.org/api/rest_v1/#/Mediarequests%20data> :)) >>> >>> >>> >>> On Thu, Nov 3, 2022 at 5:52 PM Michele Mauri <michele.ma...@polimi.it> >>> wrote: >>> >>> Thanks. My goal is to understand which are the most viewed images on >>> Commons through Wikipedia. By reading the mediacount description, it is >>> possible to get the number of transfers. But if I got it well it counts all >>> the images transferred to the user, making difficult to understand which >>> have been really “seen” by the user. Furthermore, it provides all the >>> interface images and icons, making difficult to filter only on the images >>> used to illustrate the article. >>> >>> >>> >>> Focusing only on media viewer clicks seems was a possible solution for >>> solving those issues. If you have other suggestions, they are welcome! >>> >>> >>> >>> Best >>> >>> >>> >>> Michele >>> >>> >>> >>> *From: *Dan Andreescu <dandree...@wikimedia.org> >>> *Date: *Thursday, 3 November 2022 at 22:30 >>> *To: *A mailing list for the Analytics Team at WMF and everybody who >>> has an interest in Wikipedia and analytics. < >>> analytics@lists.wikimedia.org> >>> *Cc: *Michele Mauri <michele.ma...@polimi.it> >>> *Subject: *Re: [Analytics] Mediacounts fields >>> >>> We don't have any public data on media viewer interactions >>> specifically. We used to have instrumentation on that feature but we >>> haven't tracked it since last year. To get access to some of the old >>> sanitized data that was retained for research purposes, you'd have to file >>> a formal research proposal, and it doesn't seem likely to get approved, but >>> maybe tell us more about what you're trying to do? >>> >>> >>> >>> What questions are you hoping to answer, maybe there's another way or >>> another kind of dataset that would serve more use cases? >>> >>> >>> >>> On Thu, Nov 3, 2022 at 4:12 PM Michele Mauri via Analytics < >>> analytics@lists.wikimedia.org> wrote: >>> >>> Hello, >>> >>> >>> >>> For an academic research, I'd like to see which are the most viewed >>> images through the "media viewer". >>> >>> >>> >>> Do you know if it’s possible to get this information? I looked on the >>> wikitech portal, but I found just the mediacounts ( >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts) >>> which is not what I’m looking for. >>> >>> >>> >>> Thank you >>> >>> >>> >>> Michele >>> >>> _______________________________________________ >>> Analytics mailing list -- analytics@lists.wikimedia.org >>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org >>> >>> _______________________________________________ >>> Analytics mailing list -- analytics@lists.wikimedia.org >>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org >>> >> _______________________________________________ >> Analytics mailing list -- analytics@lists.wikimedia.org >> To unsubscribe send an email to analytics-le...@lists.wikimedia.org >> >
_______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-le...@lists.wikimedia.org