On Fri, 31 Aug 2012 05:36:18 -0700, Ariel T. Glenn <ar...@wikimedia.org> wrote:

So it's time to have this discussion again.  At least, I think we're
having it again, though I could not find previous threads on this list
about the subject.

In short, scaled media is currently generated on the fly for any size
and for any user.  The resulting files are kept around forever or until
we run perilously short of space, at which point we make some guesses
about what we can toss and then do a mass purge. Last time we did so, we
had the rotation bug going at the same time, which made for a real fine
mess.

A little bit of crunching shows me that we have about 6 million images
in use on the projects, and yet we manage to have around 130 million
thumbnails.  Just for fun I checked to see how many thumbs each image
has, what sizes we are looking at, etc.  Here's the results.

Some "standard" sizes are most popular, with between 200K and 640K media
files having thumbs scaled to each of these widths:
75, 120, 150, 180, 200, 220, 320, 640, 800, 1024, and 1280 pixels

But there's plenty of "odd" sizes with lots of thumbs too. For example,
over 65K files with width 181px, 20K with width 138px.

As an experiment and before having this data, I purged from ms5 (no
longer in use for thumbs) 1/16 of the thumbs that were greater than
100px wide but not one of these widths:
120px, 200px, 220px, 250px, 320px, 640px, 800px
We got back over 300GB of space.

The other thing about delivering any scaled version on demand is that we
have some media files with several hundred different thumb sizes in
there. Here's a few of the top offenders for your entertainment:

 2514  wikipedia/commons/thumb/f/f9/Orange_and_cross_section.jpg
 2285  wikipedia/commons/thumb/f/fb/Thrermal_grease.jpg
 2218  wikipedia/commons/thumb/f/fc/Blue_sport.jpg
 2071  wikipedia/commons/thumb/f/f3/Flag_of_Switzerland.svg
 2062  wikipedia/commons/thumb/f/f2/Flag_of_Costa_Rica.svg
 2034  wikipedia/commons/thumb/f/f8/Wiktionary-logo-en.svg
 1915  wikipedia/commons/thumb/f/f6/VeulesLesRoses.JPG
 1689  wikipedia/commons/thumb/f/fa/Wikibooks-logo.svg
 1447  wikipedia/commons/thumb/f/fa/Wikiquote-logo.svg
 1371  wikipedia/commons/thumb/f/f0/Mori_Uncanny_Valley.svg
 1249  wikipedia/commons/thumb/f/f5/Grand_prismatic_spring.jpg
 1246  wikipedia/commons/thumb/f/f3/Mature.jpg
 1191  wikipedia/commons/thumb/f/f7/Kirchdorf_in_Tirol.JPG
 1187  wikipedia/commons/thumb/f/f8/Camille_Cabral_pour_les_Trans.JPG
 1143  wikipedia/commons/thumb/f/f7/Profanity.svg
 1079  wikipedia/commons/thumb/f/f2/HSV_color_solid_cone.png
 1040  wikipedia/commons/thumb/f/f2/Carmen_Electra.jpg
 1032  wikipedia/commons/thumb/f/f1/Pink_eye.jpg
 1001  wikipedia/commons/thumb/f/f6/USNS_Medgar_Evers_announcement.jpg

I'd comment on some of those but I'd be too snarky.

So there are some things we could change:

1.  We could generate and keep only certain sizes, tossing the rest.
2.  We could keep *nothing*, scaling all media as required.
3.  We could have a cron job that was clever about tossing thumbs every
day (not sure how easy it would be to be clever).
4.  ??

In any of these cases, the squids will have copies of recently requested
scaled media, so we won't be scaling the same file to the same size over
and over in a short time frame.

What do folks think about how to proceed?

Ariel

Another idea I've played with was development of a LRU filesystem. Probably a FUSE module. You would mount it at thumbs/ and unused files would periodically disappear.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to