Miriam wrote a query to find images used over N times on a wiki, probably a
placeholder or icon. (
https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/cassandra.py#L167).
And here's the query to calculate this threshold. (
https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/cassandra.py#L143
).


On Fri, Nov 4, 2022 at 2:52 PM Neil Shah-Quinn <nshahqu...@wikimedia.org>
wrote:

> I believe Connie Chen and Isaac Johnson did some work on distinguishing
> "real images" from icons as part of the image suggestion analytics (
> T292316 <https://phabricator.wikimedia.org/T292316>). I don't know the
> details, but perhaps one of them could chime in.
>
> -----
> Neil Shah-Quinn
> senior data scientist, Product Analytics
> <https://www.mediawiki.org/wiki/Product_Analytics>
> Wikimedia Foundation <https://wikimediafoundation.org/>
>
>
> On Fri, 4 Nov 2022 at 11:17, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> hm, you know, maybe it's not such a great idea to show all these small
>> files in the mediarequests/top endpoint.  I imagine everyone trying to use
>> it would have the same problems you are.  Maybe we can brainstorm together
>> on a way to filter out results you might not want.  If that top 1000 list
>> included only images you found interesting, would that solve your problem?
>> If so, let's brainstorm.
>>
>> So the schema of the data we have available is this
>> <https://github.com/wikimedia/analytics-refinery/blob/master/hql/mediarequest/create_mediarequest_table.hql>
>> .
>>
>> base_name            string COMMENT 'Base name of media file',
>> media_classification string COMMENT 'General classification of media
>> (image, video, audio, data, document or other)',
>> file_type            string COMMENT 'Extension or suffix of the file
>> (e.g. jpg, wav, pdf)',
>> total_bytes          bigint COMMENT 'Total number of bytes',
>> request_count        bigint COMMENT 'Total number of requests',
>> transcoding          string COMMENT 'Transcoding that the file was
>> requested with, e.g. resized photo or image preview of a video',
>> agent_type           string COMMENT 'Agent accessing the media files, can
>> be spider or user',
>> referer              string COMMENT 'Wiki project that the request was
>> refered from. If project is not available, it will be either internal,
>> external, or unknown',
>> dt                   string COMMENT 'UTC timestamp in ISO 8601 format
>> (e.g. 2019-08-27T14:00:00Z)'
>>
>> And here's some sample data (request count > 50000 for privacy).
>>
>>
>> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","486642310","119697","image_0_199","user","en.wikipedia","2022-09-09T06:00:00Z","2022","9","9","6"
>>
>> "/wikipedia/commons/d/d4/Button_hide.png","image","png","26477640","93145","original","user","en.wikipedia","2022-09-09T23:00:00Z","2022","9","9","23"
>>
>> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","300264742","73620","image_0_199","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
>>
>> "/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif","image","gif","27279795","93779","original","user","ja.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>>
>> "/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg","image","svg","86260254","130257","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>>
>> "/wikipedia/commons/f/fa/Wikiquote-logo.svg","image","svg","254832231","83127","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>>
>> "/wikipedia/en/a/a4/Flag_of_the_United_States.svg","image","svg","76327061","90739","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>>
>> "/wikipedia/commons/b/b6/Queen_Elizabeth_II_in_March_2015.jpg","image","jpeg","1156030104","58651","image_200_399","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
>>
>> "/wikipedia/commons/2/28/Aaj_tak_logo.png","image","png","57716837856","469335","original","user","external","2022-09-09T02:00:00Z","2022","9","9","2"
>>
>> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","682088336","168655","image_0_199","user","en.wikipedia","2022-09-09T22:00:00Z","2022","9","9","22"
>>
>> Can you do some poking around to see if there's a size in bytes that
>> would be a good threshold, or a standard transcoding that is most used on
>> articles, or anything that would allow us to filter to only the kinds of
>> images you're interested in?  If we find that, my thought is we can just
>> update the data behind the top 1000 endpoint.  Then, if people want it
>> unfiltered, they can download the dumps, but that seems like the
>> exceptional case.
>>
>> (note: you would divide total_bytes by request_count if you want the size
>> of the file)
>>
>>
>> On Fri, Nov 4, 2022 at 11:10 AM Michele Mauri via Analytics <
>> analytics@lists.wikimedia.org> wrote:
>>
>>> Hi! Yes I already tested those two ways. I used the mediarequests api (
>>> https://wikimedia.org/api/rest_v1/metrics/mediarequests/top/en.wikipedia.org/image/2022/05/all-days)
>>> but since they are just the first 1000 the largest part is composed by
>>> icons, buttons ets. While I’d like to focus on the images that illustrate
>>> an article.
>>>
>>>
>>>
>>> I wrote a script to download all the dumps, open, sort and filter them
>>> to get a longer list, but it’s very time consuming.
>>>
>>>
>>>
>>> I used in the past articles popularity as proxy, but I was looking for a
>>> more granular approach and considering the usage of images also across
>>> different linguistic versions
>>>
>>>
>>>
>>> Best
>>>
>>>
>>>
>>> Michele
>>>
>>>
>>>
>>> *From: *Dan Andreescu <dandree...@wikimedia.org>
>>> *Date: *Friday, 4 November 2022 at 15:17
>>> *To: *Michele Mauri <michele.ma...@polimi.it>
>>> *Cc: *A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics. <
>>> analytics@lists.wikimedia.org>
>>> *Subject: *[Analytics] Re: Mediacounts fields
>>>
>>> I see.  In practice, the mediaviewer instrumentation also had some
>>> inaccuracies.  For example, the code pre-fetched certain images when
>>> opening a gallery even if the viewer never ended up looking at them.  I
>>> think they adjusted the instrumentation to account for that, but I don't
>>> remember the details.
>>>
>>>
>>>
>>> One thought I had is, have you checked the mediarequests API
>>> <https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests>?
>>> It's used to power metrics like top media requests
>>> <https://stats.wikimedia.org/#/en.wikipedia.org/content/top-mediarequests> 
>>> (per
>>> project per month).  And you can query it directly
>>> <https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F1%2F1a%2FFlag_of_Argentina.svg/monthly/2022010100/2022100100>
>>> for specific images.  It's backed by the same mediacounts data, so you're
>>> right, it counts all transfers.  But that's a pretty good proxy for what
>>> was seen by a user.  If you look at the top 1000 files requested I linked,
>>> you'll see a lot of icons and flags at the top, which makes sense.  But in
>>> between all that you'll see real images like Liz Truss's portrait and
>>> Socrates and all that.  You could filter to only larger images by
>>> downloading the image and checking its size.
>>>
>>>
>>>
>>> Or you can go another way and look at the top 1000 articles
>>> <https://stats.wikimedia.org/#/en.wikipedia.org/reading/top-viewed-articles>
>>> on a wiki, find all their images, and analyze those.
>>>
>>>
>>>
>>> Take a look around at the APIs and see if there's a way forward through
>>> that data (the stats.wikimedia.org site queries the API directly on the
>>> client-side, so if you open up your browser's developer tools you can
>>> discover the API that way.  You can of course also browse the dynamic
>>> docs <https://wikimedia.org/api/rest_v1/#/Mediarequests%20data> :))
>>>
>>>
>>>
>>> On Thu, Nov 3, 2022 at 5:52 PM Michele Mauri <michele.ma...@polimi.it>
>>> wrote:
>>>
>>> Thanks. My goal is to understand which are the most viewed images on
>>> Commons through Wikipedia. By reading the mediacount description, it is
>>> possible to get the number of transfers. But if I got it well it counts all
>>> the images transferred to the user, making difficult to understand which
>>> have been really “seen” by the user. Furthermore, it provides all the
>>> interface images and icons, making difficult to filter only on the images
>>> used to illustrate the article.
>>>
>>>
>>>
>>> Focusing only on media viewer clicks seems was a possible solution for
>>> solving those issues. If you have other suggestions, they are welcome!
>>>
>>>
>>>
>>> Best
>>>
>>>
>>>
>>> Michele
>>>
>>>
>>>
>>> *From: *Dan Andreescu <dandree...@wikimedia.org>
>>> *Date: *Thursday, 3 November 2022 at 22:30
>>> *To: *A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics. <
>>> analytics@lists.wikimedia.org>
>>> *Cc: *Michele Mauri <michele.ma...@polimi.it>
>>> *Subject: *Re: [Analytics] Mediacounts fields
>>>
>>> We don't have any public data on media viewer interactions
>>> specifically.  We used to have instrumentation on that feature but we
>>> haven't tracked it since last year.  To get access to some of the old
>>> sanitized data that was retained for research purposes, you'd have to file
>>> a formal research proposal, and it doesn't seem likely to get approved, but
>>> maybe tell us more about what you're trying to do?
>>>
>>>
>>>
>>> What questions are you hoping to answer, maybe there's another way or
>>> another kind of dataset that would serve more use cases?
>>>
>>>
>>>
>>> On Thu, Nov 3, 2022 at 4:12 PM Michele Mauri via Analytics <
>>> analytics@lists.wikimedia.org> wrote:
>>>
>>> Hello,
>>>
>>>
>>>
>>> For an academic research, I'd like to see which are the most viewed
>>> images through the "media viewer".
>>>
>>>
>>>
>>> Do you know if it’s possible to get this information? I looked on the
>>> wikitech portal, but I found just the mediacounts (
>>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts)
>>> which is not what I’m looking for.
>>>
>>>
>>>
>>> Thank you
>>>
>>>
>>>
>>> Michele
>>>
>>> _______________________________________________
>>> Analytics mailing list -- analytics@lists.wikimedia.org
>>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>>>
>>> _______________________________________________
>>> Analytics mailing list -- analytics@lists.wikimedia.org
>>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>>>
>> _______________________________________________
>> Analytics mailing list -- analytics@lists.wikimedia.org
>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>>
>
_______________________________________________
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

Reply via email to